OpenMP task performance issues

Mentzer__Stuart · ‎09-26-2019

Hello,

I am seeing some surprising performance with OpenMP task support with Intel C++ 19.0 Update 5 that I don't get with GCC 9.2. In the demo app below I expect the in-loop taskwait or the alternative taskgroup to cause it to have a single thread load and run about the same speed as the serial application. GCC gives this but Intel C++ gives 100% CPU load and a 1.8x slowdown. More importantly for our real application, we get the same slowdowns instead of speedups using a set of tasks within a taskgroup or followed by a taskwait.

// Demo for Intel C++ 19.0 Update 5 OpenMP performance issues

// Serial speed of Intel C++ is ~3.3x slower than GCC 9.2.0

// With only taskwait after while loop on quad-core Haswell CPU:
//  Intel C++: 2.8x speedup
//  GCC: 3.5x speedup
//  Both use 100% CPU as expected

// With taskwait in while loop:
//  Intel C++: 100% CPU usage and give 1.8x slowdown
//  GCC: 1 CPU/thread used and no slowdown as expected
//  This taskwait is not needed here but the same issue is seen in real application with multiple tasks followed by a taskwait
//  Same behavior seen with a taskgroup around the one task instead of this taskwait

// icl /Qstd=c++11 /DNOMINMAX /DWIN32_LEAN_AND_MEAN /DNDEBUG /Qopenmp /O3

#include <atomic>
#include <cstddef>
#include <iostream>
#include <omp.h>

int
main()
{
	#pragma omp parallel
	{
		#pragma omp single
		{
			bool run( true );
			std::size_t i( 0 );
			std::atomic_size_t sum( 0u );
			double const wall_time_beg( omp_get_wtime() );
			while ( run ) {
//				#pragma omp taskgroup // Same behavior as the in-loop taskwait
				{
				#pragma omp task shared(sum)
				{
					std::size_t loc( 0u );
					for ( std::size_t k = 0u; k < 2000000000u; ++k ) loc += k/2;
					sum += loc;
				} // omp task
				} // omp taskgroup
				#pragma omp taskwait // GCC gives expected 1 CPU/thread usage: Intel C++ gives 100% CPU and 1.8x slowdown!
				if ( ++i > 50u ) run = false;
			}
			#pragma omp taskwait
			std::cout << "sum = " << sum << ' ' << omp_get_wtime() - wall_time_beg << ' ' << i << std::endl;
		} // omp single
	} // omp parallel
}

Can anyone shed light on this? Looks like a buggy task implementation but maybe there is more to it.

Thanks,
Stuart

jimdempseyatthecove · ‎09-27-2019

The likely problem is a coding error and/or assumption on your part.

1) To follow your code as written, line 37 should use a reduction on sum (not private)
2) Without taskgroup on line 35, taskwait on 44 is misused
3) With taskgroup on line 35, the enclosed group region only invokes one task (iow you are not using taskgroup for nested tasking)

As to why the difference, my guess is gcc, for each iteration of (run) is assigning the omp task to the same logical processor, whereas the Intel version is arbitrarily picking a thread, due to KMP_BLOCKTIME>0), this happens to be .NOT. the same thread.

You need to rework your test program to something more reasonable.

Jim Dempsey

Mentzer__Stuart · ‎09-27-2019

Thanks for the response Jim but I don't your comments are valid or resolve the issue. I am using just one task on purpose to be sure that the tasks aren't running in parallel to reveal the problem with Intel C++ running 100% CPU.

1) The value of sum doesn't matter to the demo/issue but tasks don't support a reduction clause so I believe using shared and an atomic accumulation should work. I used std::atomic for sum but it looks like that isn't strictly supported by OpenMP, but changing it to a size_t and adding a #pragma omp atomic before the accumulation gives the same result and the same poor behavior.

2) I don't understand this comment. As written the taskgroup is commented out and putting a taskwait after one or a set of tasks is, I believe, perfectly fine/normal. When I enabled the taskgroup alternative I commented out the taskwait. Yes, normally you would have a taskwait after multiple tasks but it should be fine to use it this way just to demonstrate the performance issue. Adding a second task before the taskwait still uses 100% CPU instead of the expected ~2 threads that GCC uses.

3) Yes, for demo purposes I'm only showing one task. That should not make this use of taskgroup invalid. I just wanted to see if taskgroup gave the same performance problem as taskwait (and it does). You can put a second task in there and the same behavior/issue is still present. GCC uses ~2 threads and completes the 2-task demo 6.5x faster than Intel C++, which uses 100% CPU.

Even if Intel C++ is picking a different thread on each pass, with the taskwait or taskgroup used to prevent running the tasks in parallel it should not cause 100% usage of 8 CPUs for the entire run. The behavior makes no sense and looks buggy.

This code is a minimal demo, not application code, and it seems to show that something is wrong with the Intel C++ task implementation. I don't see anything in your comments that would lead to a better demo or that explain what is going on here.

Stuart

jimdempseyatthecove · ‎09-29-2019

1) >> tasks don't support a reduction clause so I believe using shared and an atomic accumulation should work.
Then in your production code, I suggest you consider produce your own reduction code. IOW use a thread local value. e.g. mySumDelta+=loc in the loop, then sum+=mySumDelta outside the loop.

2) >> putting a taskwait after one or a set of tasks is, I believe, perfectly fine/normal.
Without taskgroup, the test encounters 51 taskwaits, not one taskwait.
With task group, each of your "groups" evokes a single task. The intent of a taskgroup is for the group to evoke multiple tasks (similar to nested parallelism in pre-task versions of OpenMP), and for taskwait within a taskgroup to wait for all threads of the current group.

3) You have a misconception. Taskwait is not equivalent to pthread join. IOW the thread is not terminated. Instead, threads, once created (first time passing through !$OMP PARALLEL, and optionally at first nested !$OMP PARALLEL (per thread at level), and optionally at taskgroup per thread at current taskgroup nest level...
When a parallel loop exits, or a parallel region exits, or a task wait (in taskgroup or not) is reached, the thread is not terminated .AND. the thread is not immediately placed on a condition wait. Instead (implementation dependent) the thread enters a SpinWait. IOW a compute loop looking for the next thing to work on. Only after a period of time (100-300ms) the thread then places itself in a condition wait state. Now then if during this spinwait time the thread finds work to do, it competes with other threads to take hold of the work....
So what, you might ask.

Well, the OpenMP task statement provides for any thread (available to the evoking task) to: a) available one in spin wait to take task, b) sleeping one to be awaken and to compete with available thread(s) to take the task, or c) lacking a) or b) when permitted to spawn a new thread, spawn a new thread to compete with available thread(s) to take the task. The maximum length of time spent in spinwait is controllable.

With this knowledge at hand, you should be able to see that should you have two methods for task to thread assignment:

1) Deterministic: same thread assigned in same sequence of task invocation at main level and (nested) task group level.
2) Opportunistic: first acquiring thread (available to the evoking task) to: a) available one in spin wait to take task, b) sleeping one to be awaken and to compete with available thread(s) to take the task, or c) lacking a) or b) when permitted to spawn a new thread, spawn a new thread to compete with available thread(s) to take the task.

And should your test be constructed to permit all hardware threads to be available to the test program, BUT where the test program's test region utilized less than full complement of hardware threads, that thread scheduling method 1) would show system utilization of the number of threads in the test region (same threads each entry into test region), whereas method 2) would show (may show) system utilization of more/all threads.

Method 2 tends to have lower latencies for the application at the expense of more CPU time consumed from the system (and other applications). Should the excess CPU time in spinwait be of concern, then you can either restrict your application to fewer threads or specify smaller (or 0) spinwait time (KMP_BLOCKTIME or omp_.... equivalent).

Method 1 tends to favor other applications running on your system over latencies within your application.

Jim Dempsey

jimdempseyatthecove · ‎09-29-2019

I might add that a proper test (IMHO) should identify first time overhead in addition to subsequent overhead.

for this, add

main()
{
for(int rep=0; rep<3; rep++)
{
your test code here
}
} // main

Then observe the execution time for each repetition of the test region.

Jim Dempsey

Mentzer__Stuart · ‎09-29-2019

Thanks again, Jim. I appreciate the input but I think we are off track here.

1) The code as shown IS a correct reduction for sum and computes the correct sum. loc IS a thread-local value and it DOES accumulate into sum outside the for loop. Try it. Anyway, this sum code is only there to give the task some work to do that isn't optimized away.

2) I am showing the (commented out) taskgroup as an alternative to the in-while-loop taskwait. The intent is for both of them to PREVENT parallel execution of the single task to demonstrate that Intel C++ is running 100% CPU when it shouldn't be (and GCC doesn't). I WANT IT to do 51 task waits. The effect of taskgroup OR that taskwait is the same here, as it should be. Since there are no nested tasks they both behave the same, as they should. Here is a documentation excerpt that explains that:

You can synchronize tasks by using the taskwait or taskgroup directives.
When a thread encounters a taskwait construct, the current task is suspended until all child tasks that it generated before the taskwait region complete execution.
When a thread encounters a taskgroup construct, it commences to execute the taskgroup region. At the end of the taskgroup region, the current task is suspended until all child tasks that it generated in the taskgroup region and all of their descendant tasks complete execution.

3) Nowhere do I imply that I think taskwait terminates a thread -- not sure where you got that. I understand that it suspends all tasks until they reach that point. In this case the purpose is to prevent the parallel task execution that you would normally want so that I can understand why Intel C++ is having huge performance problems with the use of tasks in an actual application. The taskwait is meant to prevent the next task from being started until the previous one completes, which should only allow one task to run at a a time. This is of course silly in a real application and only done here because the task performance was terrible. And indeed it shows a problem with the Intel C++ behavior.

I appreciate the input but am not sure you are following what I am trying to show here. This is not a snippet of application code: this is an intentionally artificial demo to show the strange misbehavior of Intel C++. Specifically, there should only be one task running at a time yet Intel C++ is fully loading up the CPUs. Run this with GCC and Intel C++ and watch your CPU loading then maybe it will be more clear.

Intel Support: Please run this demo with Intel C++ and GCC and see if there is a bug here. I am pretty sure there is.

Viet_H_Intel · ‎09-30-2019

With our new -qnextgen compiler option that uses LLVM Technology and saw a better result than GCC (8.1)

$ rm a.out && icpc -std=c++11 -fopenmp -O3 t1.cpp && ./a.out
sum = 14106511801580896768 58.836 51
$ rm a.out && g++ -std=c++11 -fopenmp -O3 t1.cpp && ./a.out
sum = 14106511801580896768 34.8278 51
$ rm a.out&& icpc -std=c++11 -fopenmp -O3 t1.cpp -qnextgen && ./a.out
sum = 14106511801580896768 26.4163 51
$ gcc -v
gcc version 8.1.0 (GCC)
$ icpc -V
Intel(R) C++ Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 19.0.5.281 Build 20190815

jimdempseyatthecove · ‎09-30-2019

RE: 1

While loc is on stack of current thread at point of

sum += loc;

The variable sum is a shared (atomic) variable on the stack of the thread acquiring the single region.

While sum being atomic, and being thread-safe (correct) for += operation, it is NOT thread efficient as it will require a LOCKed operation (usually XADD in this case). Note, the sum += loc occurs within a task region within the single region. While the single region has one "master" of the region (the arbitrary thread that acquired the single region). Any available thread can acquire the enqueued task (in this example, all threads could execute any of the tasks). += on an atomic variable may cost ~200x that of += on local variable.

RE 2: parallel execution of the single task to demonstrate that Intel C++ is running 100% CPU when it shouldn't be (and GCC doesn't).

Not true. While (in the sample code) a single thread will execute each task (task serial), the Intel version is (likely) running each task on a different logical processor, whereas the GCC version (apparently) is running the "next" task on the same logical processor. IOW on the Intel version upon completion of taskwait, some other thread acquires the next task while the just finished thread enters the 300ms spin-waitr

cpu
1  |Task1|--300ms--|zzz|TaskN|...
2        |Task2|200ms|Task4|--300ms--|...
5              |Task3|--300ms--|
10                         |Task5|--300ms--|
...

Note, while the Tasks run sequentially, the specific thread used may (generally) differ, .AND. each of these threads consume some or all of their KMP_BLOCKTIME in a compute loop looking for work

-read the description deterministic verses opportunistic scheduling of Tasks.

>>3) ... I understand that it suspends all tasks until they reach that point.

While the tasks are serialized by your placement of taskwait, the thread completing the task remains in a compute state competing with all other threads for the next task. It will remain in a compute state until the earlier of: obtaining a task, or expiring the block time.

>>Specifically, there should only be one task running at a time

Then it is your responsibility to ask for this behavior. Either set the Intel Specific environment variable

KMP_BLOCKTIME=0

or use the Intel specific OpenMP runtime function kmp_set_blocktime(int ms)

or (if available) OpenMP V4.5 and later environment variable

OMP_WAIT_POLICY=PASSIVE

(Intel's default is OMP_WAIT_POLICY=ACTIVE with spin-wait max time set by KMP_BLOCKTIME)

Jim Dempsey

jimdempseyatthecove · ‎09-30-2019

Now, why it may be bad to set KMP_BLOCKTIME=0 or OMP_WAIT_POLICY=PASSIVE

At issue here, is if at the point of taskwait, for the just completing thread, should there be no immediate next task, it will immediately take the long trip to suspend itself, and then take a similarly long time to wakeup when task available (and not taken by other thread).

While this behavior is ideal for your simple test program above, it may be detrimental to your application, especially when your application repeatedly loops through entering and exiting a parallel region (or same taskgroup within a parallel region) or other quirky task enqueue-ing scenarios. By burning some CPU time, you can greatly reduce the task startup latency. It is the programmer's choice when using runtime library support routines and/or the system manager when using environment variables.

Jim Dempsey

Mentzer__Stuart · ‎09-30-2019

Viet Hoang,

Thanks for the tip about /Qnextgen. I tried that and it is >2X faster across the board!

I think there are two things going on here.

First, Intel C++ on Windows is much slower than GCC for OpenMP tasks. The /Qnextgen narrows the gap considerably but there is still a gap that is worth looking at. The difference in real parallel usage can be seen with the updated demo below by commenting out the taskwait.

Second, Intel C++ appears to be fully spinning the waiting threads in this contrived taskwait demo, which I don't see on GCC and didn't expect. This is what Jim has been hinting at. I now don't think this is relevant to the performance issue I was trying to track down in our actual application code. But since I can't change this behavior with OMP_WAIT_POLICY=PASSIVE and/or KMP_BLOCKTIME=0 then this may still be an issue with the compiler.

As far as the GCC performance advantage, apart from the waiting threads not spinning, it might just be doing better at optimizing the simple work loop here, but it seems worth looking into.

For the record here is an updated demo code that correctly shows the total CPU time on Windows and eliminates some possible sources of confusion.

// Demo for Intel C++ 19.0 Update 5 OpenMP Windows performance issues

// Serial speed:
//  Intel C++: ~3.4X slower than GCC 9.2.0
//  Intel C++ /Qnextgen: ~1.5X slower than GCC 9.2.0

// Parallel (no taskwait) wall clock speed on quad-core Haswell CPU:
//  Intel C++: ~2.8X speedup
//  Intel C++ /Qnextgen: ~2.9X speedup
//  GCC: ~3.5x speedup
//  All use 100% CPU as expected

// With taskwait to prevent parallel execution:
//  Intel C++: ~2X slowdown and 100% CPU
//  Intel C++ /Qnextgen: ~1.7X slowdown and 100% CPU
//  GCC: No slowdown and 1 CPU/thread used as expected
//  This shows a spin wait behavior: OMP_WAIT_POLICY=PASSIVE and KMP_BLOCKTIME=0 had no effect

// icl /Qstd=c++11 /DNOMINMAX /DWIN32_LEAN_AND_MEAN /DNDEBUG /Qopenmp /QxHOST /O3
// icl /Qnextgen /DNOMINMAX /DWIN32_LEAN_AND_MEAN /DNDEBUG /Qopenmp /QxHOST /O3

#include <cstddef>
#include <iostream>
#include <omp.h>
#ifdef _WIN32
#include <windows.h>
#else
#include <ctime>
#endif

double
get_cpu_time()
{
#ifdef _WIN32
	FILETIME a, b, c, d;
	if ( GetProcessTimes( GetCurrentProcess(), &a, &b, &c, &d ) != 0 ) { // OK
		return (double)( d.dwLowDateTime | ( (unsigned long long)d.dwHighDateTime << 32 ) ) * 0.0000001;
	} else { // Error
		return 0.0;
	}
#else // Posix
	return double( std::clock() ) / CLOCKS_PER_SEC;
#endif
}

int
main()
{
	bool run( true );
	std::size_t i( 0u );
	std::size_t sum( 0u );
	double const cpu_time_beg( get_cpu_time() );
	double const wall_time_beg( omp_get_wtime() );
	#pragma omp parallel
	{
		#pragma omp single
		{
			while ( run ) {
				#pragma omp task shared(sum)
				{
					std::size_t loc( 0u );
					for ( std::size_t k = 0u; k < 2000000000u; ++k ) loc += k/2;
					#pragma omp atomic
					sum += loc;
				} // omp task
				#pragma omp taskwait // GCC gives expected 1 CPU/thread usage: Intel C++ gives 100% CPU and 1.8x slowdown!
				if ( ++i >= 50u ) run = false;
			} // while
		} // omp single
	} // omp parallel
	double const cpu_time( get_cpu_time() - cpu_time_beg );
	double const wall_time( omp_get_wtime() - wall_time_beg );
	std::cout << "Sum: " << sum << std::endl;
	std::cout << "Simulation CPUs time: " << cpu_time << " s" << std::endl;
	std::cout << "Simulation wall time: " << wall_time << " s" << std::endl;
}

Mentzer__Stuart · ‎09-30-2019

Jim,

Thanks for the additional info. More of the situation and your comments have become clear after further research and testing.

1) Although the efficiency of the reduction here is not the point of this demo it would be useful to know a way to do that type of operation more efficiently in tasks (at least until the task reduction support is widespread). I looked around for a prescriptive method and the most promising seemed to be making the local accumulator threadprivate and doing the global accumulation outside of the task and single blocks, but this was actually a bit slower. If you have demo code for this I'm sure that would be of interest to others as well as myself.

2) I now understand that my contrived anti-parallel parallel demo is causing spin waits on the idle threads whereas GCC doesn't do that. Your prior comments now make sense in that context. While it isn't very relevant to my actual application (whereas the parallel task slowness relative to GCC is) it is still good to understand that a wait directive may not release threads to work on other tasks. But, as I noted above, the environment variables that should suppress the idle thread spinning aren't working. If there is a reason for this or a work-around that would be useful info.

Thanks,
Stuart

jimdempseyatthecove · ‎10-01-2019

>> 1)

I misread your earlier post and thought the atomic += was inside the for loop. Being outside the loop, it should be fine to reduce this way.

>> wait directive may not release threads to work on other tasks

This is NOT what I said. The taskwait directive DOES release the thread to work on other task(s). During the spinwait time it continually attempts to obtain a waiting tasks, and if successful, immediately runs the task. It happens in your test program using Intel, that some other thread (in spinwait) obtains the next task prior to the thread completing task obtaining the next task. Should the spinwait time expire before a task is found, then the thread falls into a condition wait (Linux) or WaitFor event (Windows).

Jim Dempsey

jimdempseyatthecove · ‎10-01-2019

Crude hypothetical chart of your test program using Intel:

cpu     time->    q=task, a=add thread, s=start, e=execute, t=taskwait, c=condition wait, ...=spinwait
0	aaaaaaa.........................................cccccccccccccccc{and so on}
1        s.................eeeeeeeeeeet.................................{and so on}
2         s.........................................cccccccccccccccccccc{and so on}
3          s.q............q............q............q............q......{and so on}
4           s.eeeeeeeeeeet...........................eeeeeeeeeeet.......{and so on}
5            s..........................................cccccccccccccccc{and so on}
6             s..........................................ccccccccccccccc{and so on}
7              s........................eeeeeeeeeeet..............eeeeee{and so on}

The main thread (0) has the initial overhead of establishing the OpenMP thread pool (c)
Thread 3 happens to acquire the single and enqueues tasks (in the above, doesn't take a task)
Thread 4 took the first task
Thread 3 (and all other threads) in spinwait during execution of first task
On taskwait by thread 4, the still spinwaiting thread 3 immediately observes the completion of the task and enqueues next task
Thread 1 (still in initial spinwait) happens to take task _before_ thread 4 takes next task
...

Jim Dempsey

Mentzer__Stuart · ‎10-01-2019

Jim,

OK, glad we resolved the reduction issue.

I get that it should release a wait-spinning thread to another task. My broader issues are:

Why is the Intel C++ performance much worse than GCC (serial and parallel)? The /Qnextgen should be an improvement but it dies trying to build our real application yet so that is hypothetical at this point.
Why don't the environment variables prevent wait spinning?
Why do the waiting threads appear to spin continuously for the entire minute+ run? I guess if the scheduler keeps bumping the one task to different threads that could prevent any of them from sleeping but that seems like a bad design and I thought that tasks defaulted to "tied".

jimdempseyatthecove · ‎10-02-2019

>>Why is the Intel C++ performance much worse than GCC (serial and parallel)?

I am unable to test this on v19u5. You should be able to run VTune and examine the differences. Look at the call stack differences in addition to hot spot.

>>The /Qnextgen should be an improvement but it dies trying to build our real application yet so that is hypothetical at this point.

I cannot test that option. It must be an undocumented option (now disclosed) to Beta test new (test) compiler optimizations.

>>Why don't the environment variables prevent wait spinning?

They should. Insert in main(), as first statement, code to get the intended environment variable, print exactly what you requested and what was returned if something returned, or print error information if error. Note, on Linux you may choose to use getenv, on Windows GetEnvironmentVariable. Customize error response to which call you use.

>>Why do the waiting threads appear to spin continuously for the entire minute+ run? I guess if the scheduler keeps bumping the one task to different threads that could prevent any of them from sleeping but that seems like a bad design and I thought that tasks defaulted to "tied".

VTune may identify the problem. Your test program in #10 should be sufficient for Intel to use as a reproducer. (assuming the required environment variable is set correctly).

I suggest you use both KMP_BLOCKTIME and OMP_PLACES
*** If you ar using OpenMP V5, set OMP_DISPLAY_ENV TRUE
*** if not set KMP_AFFINITY=verbose,compact

Jim Dempsey

Mentzer__Stuart · ‎10-07-2019

Looking at it with VTune is a good idea. If I learn anything I'll post here. It would be great if Intel Support looked at this too for why GCC is faster, why the spin suppression variables don't work, and why the waiting threads seem to spin continuously.

The /Qnextgen only appeared in v.19 Update 5 as a preview. It isn't solid enough to build our real application yet but the performance gains look impressive.

Viet_H_Intel · ‎10-07-2019

I did reported this issue to our Developer. The internal issue is CMPLRIL0-32109.

Mentzer__Stuart · ‎10-07-2019

Viet Hoang, thanks for submitting this the the developers. The improved example in comment #10 might be better for them to work with.

I did run this under VTune. Aside from the known spin wait time, the optimization of the inner for loop is clearly different. GCC is using AVX2 instructions for the loop (cumulative time in s follows each instruction):

vmovdqa ymm0, ymm1	2.02196s
inc eax			1.50136s
vpaddq ymm1, ymm1, ymm3	1.72021s
vpsrlq ymm0, ymm0, 0x1	2.32737s
vpaddq ymm2, ymm2, ymm0	1.85384s
cmp eax, 0x1dcd6500	2.59823s

but Intel C++ (with /QxHOST added to enable AVX2) is doing:

mov r10, rcx		11.3512s
inc rcx	 		7.494s
shr r10, 0x1		15.6598s
add r9, r10		17.0436s
cmp rcx, 0x77359400	20.1408s

With /Qnextgen the loop shows up in an omp_task_entry with a number of sections of AVX2 code that looks like the GCC block (loop unrolling?) but the result is 2+X faster than the mainline Intel C++.

Maybe this is useful/interesting information.

Thanks,
Stuart

P.S. VTune "copy rows to clipboard" is not working

jimdempseyatthecove · ‎10-08-2019

The two code sequences listed are not equivalent and thus not representative of what is going on.

You state " looks like the GCC block (loop unrolling?)" but then you do not show the unrolled loop.

0x77359400 / 0x1dcd6500 = 4 (2000000000 / 500000000 = 4)

So this indicates the loop iterated 4x more in the Intel code than in the GCC code.

*** However, due to lack of showing the complete loops, it is unclear if the GCC loop was unrolled or not.

Given the total runtime differences this indicates that the GCC loop WAS NOT unrolled....
... rather, the optimization was smart enough (code not shown) to do something like the following sketch

ymm2 = {0, 0, 0, 0}    ; sum=0 but spread horizontally across vector in 4 parts
ymm1 = {0, 1, 2, 3}    ; the four initial iteration values of k
ymm3 = {4, 4, 4, 4}    ; the k increment when spreading sum across horizontally in ymm1
loop:
ymm0 = ymm1    ; get the first/next four values for k
eax++                  ; increment the loop count (scalar iteration limit/4)
ymm0 >>=1         ; divide the first/next four values for k by 2
ymm2 += ymm0 ; add the first/next four values for k by 2 to the 4 values accumulated in sum
if(eax < limit) goto loop
presumably here performing a horizontal add of resultant 4-wide sum into 1-wide sum

**** Note, GCC is generating much more efficient code (vectorizing a scalar loop)...
**** ... however, GCC had performed a violation of the programmer's coding directive to make sum atomic.

Jim Dempsey

Mentzer__Stuart · ‎10-08-2019

Sorry for the confusion, Jim. It was the /Qnextgen assembly that was too long to show that looks like a loop-unrolled version of the GCC assembly (with AVX2 instructions).

These are the assembly blocks VTune opens for the inner for loop for GCC and Intel C++ but I don't show the pre and post core block assembly in the interest of not bloating this thread. But I'm sure Intel Support can grok the assembly differences better than me anyway.

Stuart

jimdempseyatthecove · ‎10-09-2019

You still have the issue of GCC violating its contract with the developer about making the accumulation of sum atomic.

While in the test program, as presented, one (GCC) can determine that atomic sum wasn't necessary, it is not up to the compiler to make this determination.

For example, there may be introduced an additional task that monitors the progress of sum. The GCC optimization won't advance sum until the loop completes.

Jim Dempsey