Parallel n-body performance fluctuations

DLake1 · ‎06-26-2019

I have an experimental n-body program which when made parallel either with "#pragma omp parallel for" or "#pragma parallel" has large performance fluctuations e.g. with 32768 points the frame rate will range from just under 30 to over 50 FPS and remain mostly in the 40's.

The visual effect is everything keeps going fast and slow.

I have a 9920X, the clock speed is not a problem and its not throttling, I have a custom water cooling.

I'm compiling with AVX2, the instruction set has no effect on the issue.

The code is invoked from a DLL in a C# program.

#include "n-body.h"
#include <stdio.h>
#include <string.h>
#include <thread>
const int count = 32768;
__declspec(align(64)) float px0[count];
__declspec(align(64)) float py0[count];
__declspec(align(64)) float px1[count];
__declspec(align(64)) float py1[count];
float* pointsx0 = px0;
float* pointsy0 = py0;
float* pointsx1 = px1;
float* pointsy1 = py1;
void SetData(float* x, float* y) {
	memcpy(pointsx0, x, count*4);
	memcpy(pointsx1, x, count*4);
	memcpy(pointsy0, y, count*4);
	memcpy(pointsy1, y, count*4);
}
void Compute(float* p, float* velx, float* vely, long pcount, float aspect, float zoom) {
//#pragma omp parallel for
#pragma parallel
	for (auto i = 0; i < count; ++i) {
		auto fdx = 0.0F;
		auto fdy = 0.0F;
		for (auto j = 0; j < count; ++j) {
			if(i == j) continue;
			const auto dx = pointsx0 - pointsx0;
			const auto dy = pointsy0 - pointsy0;
			const auto f = 0.000000001F/(dx*dx + dy*dy);
			fdx += dx*f;
			fdy += dy*f;
		}
		pointsx1 = pointsx0 + (velx -= fdx);
		pointsy1 = pointsy0 + (vely -= fdy);
		if (zoom != 1) {
			p[i*2] = pointsx1*zoom/aspect;
			p[i*2 + 1] = pointsy1*zoom;
		} else {
			p[i*2] = pointsx1/aspect;
			p[i*2 + 1] = pointsy1;
		}
	}
	auto tmp = pointsx0;
	pointsx0 = pointsx1;
	pointsx1 = tmp;
	tmp = pointsy0;
	pointsy0 = pointsy1;
	pointsy1 = tmp;
}

jimdempseyatthecove · ‎06-27-2019

There are a few issues with your code:

When you choose to use "#pragma parallel" (or for that matter #pragma omp parallel), this starts a parallel region with either all threads or a specified number of threads. IOW all threads of the team start the region. Further, when the work needs to be distributed (as opposed to being duplicated) it is the programmers responsibility to partition the work according to needs. Partitioning is performed by "#pragma pfor", "#pragma omp for", and/or use of omp_get_thread_num()/omp_get_num_threads(). IOW your choice of "#pragma parallel" on line 22 should have been followed by "#pragma pfor" immediately preceding for(... Conversely, had you chosen to use "#pragma omp parallel for" you would not have required a subsequent #pragma. Or had you used "#pragma omp parallel" (sans the for), you would have required "#pragma omp for".

Your j loop is executing twice as many of times as minimally required. IOW the force between two points is calculated twice (A->B, B->A). In your particular case, the total data is about 1MB, and easily fits in the 19MB "SmartCache", and likely fits each thread's partition within its L2 cache (~77KB data partition). Combine this with the fact that the force calculation is not using sqrt and only a 2D vector, the twice calculation probably is of little consequence (and may experience a net benefit due to elimination of synchronization of thread access). *** this may not be the case with datasets that do not fit in cache and/or more compute intensife (3D with sqrt).

Seeing that the function is represented by a relatively few lines of code, I suggest you make two functions: Compute_with_zoom, and Compute_without_zoom (and eliminate thw zoom test).

The vectorization generation by the compiler will likely not like having the "if(i==j) continue;" embedded in the loop. It may be better to use two loops:

		for (auto j = 0; j < i; ++j) {
			const auto dx = pointsx0 - pointsx0;
			const auto dy = pointsy0 - pointsy0;
			const auto f = 0.000000001F/(dx*dx + dy*dy);
			fdx += dx*f;
			fdy += dy*f;
		}
		for (auto j = i+1; j < count; ++j) {
			const auto dx = pointsx0 - pointsx0;
			const auto dy = pointsy0 - pointsy0;
			const auto f = 0.000000001F/(dx*dx + dy*dy);
			fdx += dx*f;
			fdy += dy*f;
		}

Jim Dempsey

jimdempseyatthecove · ‎06-27-2019

Or, you can use the single loop with something that looks non-optimal, but in fact is more vector friendly:

auto dxdxdydy = dx*dx + dy*dy;
auto f = (dxdxdydy == 0.0F) ? 1.0F : dxdxdydy;
f = 0.000000001F/f;
if(dxdxdydy == 0.0F) f = 0.0F;
fdx += dx*f;
fdy += dy*f;

Keep in mind that the vector instructions have conditional move. When the compiler can generate conditional moves it is more efficient.

The above code avoids divide by zero and the compiler should be able to remember (reuse) the mask generated by (dxdxdydy == 0.0F).

Jim Dempsey

DLake1 · ‎06-27-2019

Thanks Jim,

"#pragma parallel" is faster than "#pragma omp parallel for" and putting "#pragma pfor" after "#pragma parallel" makes no difference.

I am aware the force is being calculated twice but I couldn't figure out how to reuse the force value without creating a parallel dependence or significantly increasing the memory demand.

Eliminating the zoom test makes no difference.

I was using a double loop at one point but I think it was slower, your single loop alternative is also slower.

jimdempseyatthecove · ‎06-28-2019

Can you post your current (best) code together with the disassembly? Use VTune if possible?

>>or significantly increasing the memory demand

In your case, with the small-ish number of points, the second pass read, and for that matter each iteration reads, will fit in L3 cache .AND. with static scheduling, you may experience a high degree of L2 hits. Until you reach larger number of points (several times that fit in L3 cache), the do twice, likely won't matter that much.

>> I couldn't figure out how to reuse the force value without creating a parallel dependence

I will see if I can post a recursive method, using OpenMP, that was derived from an older TBB example from 2009 from Robert Reed. I've reworked his code to compare: TBB, OpenMP and QuickThread parallel programming paradigms. These were all coded as array of structures. I will post pertinent pieces of this code.

I am in the process of producing a different algorithm of an n-Body sample of structure of arrays with and without use of intrinsics (SSE, AVX, AVX512) as well as with/without MPI. The end goal of the project is to produce a sample particle simulation (gravitational) in 3D complete with animation. Note, the animation in this case is not for game playing, rather the purpose is to view the progress of the simulation state during runtime. In the MPI (Cluster case), one of the nodes will include a capable graphics controller (suitable to display millions of points).

Jim Dempsey

McCalpinJohn · ‎06-28-2019

In codes with a fixed amount of work, run-to-run performance variability is most often due to changes in cache hit/miss rates. With the default 4KiB page size, the L2 cache hit rate will depend on how uniformly the physical addresses are distributed across the 16 "page colors" of the 1 MiB L2 cache. In OpenMP programs, the performance will typically be limited by the core with the highest L2 miss rate. You did not specify whether you were using 1 or 2 threads per physical core -- I would guess that this code will not need two threads per core for performance, and using two threads per core will exacerbate both cache miss problems and interference due to OS activities.

In OpenMP programs performance variability can also be caused by the OS doing stupid things with thread scheduling and/or migration.... This is a big problem in multi-NUMA-node boxes, but can be a problem with some operating systems even in a single NUMA node. All OpenMP runtime systems have support for thread pinning, which will prevent the OS from making certain classes of mistakes. For your Core i9-9920X I would start with OMP_NUM_THREADS=12 and OMP_PROC_BIND=spread (or OMP_PROC_BIND=true and OMP_PLACES=cores).

When I see significant unexpected run-to-run variability, the first thing I do is reorganize the code to make it easy to instrument each thread. (I.e., change the outer loop to be a loop from 0 to OMP_NUM_THREADS-1 and compute the indices for the inner loops explicitly for each thread.) Then I add inline timers (RDTSCP) at the beginning of the parallel loop and again at the end of the parallel loop (before the implied barrier). The TSC should be synchronized across all cores, so you can compare start times, end times, and (wall-clock) durations for each thread. For more information, I add inline performance counter reads at the same places, using the interfaces provided by https://github.com/jdmccalpin/low-overhead-timers. To use the performance counters in this way requires that:

The kernel allows user-mode execution of the RDPMC instruction (CR4.PCE=1).
The performance counters are globally enabled (IA32_PERF_GLOBAL_CTRL, MSR 0x38f = 0x70000000f).
The fixed-function performance counters are enabled (IA32_FIXED_CTR_CTRL, MSR 0x38d = 0x333).
Each OpenMP thread is bound to one specific Logical Processors (rather than the default of being allowed to bounce between the two logical processors per physical core). With GNU OpenMP, I usually unset the OMP_PROC_BIND and OMP_PLACES variables and use GOMP_CPU_AFFINITY="0,1,2,3,4,5,6,7,8,9,10,11" to force the 12 OpenMP threads to each be bound to a single, unique Logical Processor.

By measuring the TSC, CPU_CLK_UNHALTED.CORE (fixed counter 1), and CPU_CLK_UNHALTED.REF (fixed counter 2), it is straightforward to compute the core utilization (delta CPU_CLK_UNHALTED.REF / delta TSC) and the average core frequency (delta CPU_CLK_UNHALTED.CORE / delta CPU_CLK_UNHALTED.REF * nominal processor frequency). Fixed function counter 0 (INST_RETIRED.ANY) can be used to find load imbalances. (For both the TSC and INST_RETIRED.ANY calculations, the "final" counter read must be *before* the barrier at the end of the loop, so the counts are not contaminated by spin-waiting in the OpenMP barrier code.)

jimdempseyatthecove · ‎06-28-2019

Here is an example of your code reworked to compute each force calculation once. *** I have not taken the time to verify outputs.

// nBody2DSOA.cpp
//
//#include "n-body.h"	// removed
#include <stdio.h>
#include <string.h>
#include <thread>
// added
#include <iostream>

#include <omp.h>
#include <random>


const int count = 32768;
const int grainsize = 128;	// split point limit (over grainsize gets split)

__declspec(align(64)) float px0[count];
__declspec(align(64)) float py0[count];
__declspec(align(64)) float px1[count];
__declspec(align(64)) float py1[count];

__declspec(align(64)) float p[count * 2];	// added
__declspec(align(64)) float velx[count];	// added
__declspec(align(64)) float vely[count];	// added
__declspec(align(64)) float fx[count];	// added
__declspec(align(64)) float fy[count];	// added

float* pointsx0 = px0;
float* pointsy0 = py0;
float* pointsx1 = px1;
float* pointsy1 = py1;

void SetData(float* x, float* y) {
	memcpy(pointsx0, x, count * 4);
	memcpy(pointsx1, x, count * 4);
	memcpy(pointsy0, y, count * 4);
	memcpy(pointsy1, y, count * 4);
}

void Compute(float* p, float* velx, float* vely, long pcount, float aspect, float zoom) {
	//#pragma omp parallel for
#pragma parallel
	for (auto i = 0; i < count; ++i) {
		auto fdx = 0.0F;
		auto fdy = 0.0F;
		for (auto j = 0; j < count; ++j) {
			if (i == j) continue;
			const auto dx = pointsx0 - pointsx0;
			const auto dy = pointsy0 - pointsy0;
			const auto f = 0.000000001F / (dx*dx + dy*dy);
			fdx += dx*f;
			fdy += dy*f;
		}
		pointsx1 = pointsx0 + (velx -= fdx);
		pointsy1 = pointsy0 + (vely -= fdy);
		if (zoom != 1) {
			p[i * 2] = pointsx1 * zoom / aspect;
			p[i * 2 + 1] = pointsy1 * zoom;
		}
		else {
			p[i * 2] = pointsx1 / aspect;
			p[i * 2 + 1] = pointsy1;
		}
	}
	auto tmp = pointsx0;
	pointsx0 = pointsx1;
	pointsx1 = tmp;
	tmp = pointsy0;
	pointsy0 = pointsy1;
	pointsy1 = tmp;
}

void Init()
{
	std::random_device rd;  //Will be used to obtain a seed for the random number engine
	std::mt19937 gen(rd()); //Standard mersenne_twister_engine seeded with rd()
	float	posLimit = 1.0e10;
	std::uniform_real_distribution<> dis(-posLimit, posLimit);
	for (int i = 0; i < count; ++i) {
		px0 = px1 = dis(gen);
		py0 = py1 = dis(gen);
		velx = 0.0f;
		vely = 0.0f;
		fx = fy = 0.0f;
	}
}


// accumulate accelerations
void addAcc(int i, int j) {
	// Compute the force between bodies and apply to each as an acceleration
	// *** when called in parallel
	// *** it is a requirement that each thread has different i and j values
	const auto dx = pointsx0 - pointsx0;
	const auto dy = pointsy0 - pointsy0;
	const auto f = 0.000000001F / (dx*dx + dy*dy);
	fx += dx*f;
	fx -= dx*f;

	fy += dy*f;
	fy -= dy*f;
}

// interaction primitives
void interactBodies(int iFrom, int iTo) // half open range
{
	// interact within the partition
	for (int i = iFrom; i<iTo - 1; ++i)
	{
		for (int j = i + 1; j<iTo; ++j)
		{
			addAcc(i, j);
		}
	}
}

void interactBodies(int iFrom, int iTo, int jFrom, int jTo) // half open range
{
	// interact within the partition
	for (int i = iFrom; i<iTo; ++i)
	{
		for (int j = jFrom; j<jTo; ++j)
		{
			addAcc(i, j);
		}
	}
}


void rect_interactOpenMP(int i0, int i1, int j0, int j1)
{
	// if there's room, further subdivide the rectangle into four smaller ones
	int di = i1 - i0; int dj = j1 - j0;
	// and limit the amount of recursion to grainsize
	if (di > grainsize && dj > grainsize)
	{
		int im = i0 + di / 2;
		int jm = j0 + dj / 2;
#pragma omp task
		rect_interactOpenMP(i0, im, j0, jm);

#pragma omp task
		rect_interactOpenMP(im, i1, jm, j1);

		// barrier here
#pragma omp taskwait

#pragma omp task
		rect_interactOpenMP(i0, im, jm, j1);

#pragma omp task
		rect_interactOpenMP(im, i1, j0, jm);

		// barrier here
#pragma omp taskwait

	}
	else
	{
		interactBodies(i0, i1, j0, j1);
	}
} // void rect_interactOpenMP(int i0, int i1, int j0, int j1)

void body_interactOpenMP(int i, int j)
{
	// split the interaction triangle into upper and lower triangles (which can be executed
	// in parallel) and the adjacent rectangle (which will be further split)
	int d = j - i;
	if (d > 1)
	{
		int k = d / 2 + i;
		// limit the amount of recursion to grainsize
		if (d > grainsize)
		{
			// recurse using 2-way split
#pragma omp task
			body_interactOpenMP(i, k);

#pragma omp task
			body_interactOpenMP(k, j);

#pragma omp taskwait
		}
		else
		{
			interactBodies(i, j);
			return;
		}
		// now interact between the two partitions
		rect_interactOpenMP(i, k, k, j);
	}
	// if d is 1 or 0 then we can skip it.
} // void body_interactOpenMP(int i, int j, int level)


void ComputeNew(float* p, float* velx, float* vely, long pcount, float aspect, float zoom) {
	// Run the simulation over a fixed range of time steps
	// with an optimal lock-free application of parallel code
#pragma omp parallel
	{
#pragma omp master
		{
			// compute the accelerations of the bodies
			// body_interactOpenMP(from, to)
#pragma omp task
			body_interactOpenMP(0, count);

#pragma omp taskwait
		} // end omp master
		// still in parallel region
#pragma omp barrier
		// apply accelerations or force and advance the bodies
#pragma omp for schedule(static, grainsize)
		for (int i = 0; i<count; ++i)
		{
			pointsx1 = pointsx0 + (velx -= fx);
			pointsy1 = pointsy0 + (vely -= fy);
			if (zoom != 1) {
				p[i * 2] = pointsx1 * zoom / aspect;
				p[i * 2 + 1] = pointsy1 * zoom;
			}
			else {
				p[i * 2] = pointsx1 / aspect;
				p[i * 2 + 1] = pointsy1;
			}
		}
	} // end omp parallel
	auto tmp = pointsx0;
	pointsx0 = pointsx1;
	pointsx1 = tmp;
	tmp = pointsy0;
	pointsy0 = pointsy1;
	pointsy1 = tmp;
}

int main(int argc, char* argv[])
{
	long pcount = 0;
	float aspect = 1.0f;
	float zoom = 1.0f;
	Init();
	std::cout << "Compute" << std::endl;
	for (int i = 0; i < 3; ++i) {
		double t0 = omp_get_wtime();
		Compute(p, velx, vely, pcount, aspect, zoom);
		double t1 = omp_get_wtime();
		std::cout << t1 - t0 << std::endl;
	}
	std::cout << std::endl;
	std::cout << "ComputeNew" << std::endl;
	for (int i = 0; i < 3; ++i) {
		double t0 = omp_get_wtime();
		ComputeNew(p, velx, vely, pcount, aspect, zoom);
		double t1 = omp_get_wtime();
		std::cout << t1 - t0 << std::endl;
	}
	return 0;
}

Output on Core i7 2600K (4 core, 8 thread)

Compute
0.541678
0.686863
0.544323

ComputeNew
0.373752
0.164821
0.150376

Jim Dempsey

jimdempseyatthecove · ‎06-28-2019

Adding __inline to addAcc showed slight improvement

__inline void addAcc(int i, int j)

Compute
0.593494
0.56345
0.553373

ComputeNew
0.416984
0.154177
0.120703

Jim Dempsey

jimdempseyatthecove · ‎06-28-2019

This is about 4.5x improvement on my system.

I haven't tried separating the zoom / no-zoom functions, nor have I experimented with the grainsize (split point). I will leave those (and data verification) up to you.

Jim Dempsey

jimdempseyatthecove · ‎06-28-2019

Using 4x the number of points (131072)

Compute
8.48962
8.50982
8.6381

ComputeNew
1.66618
1.59275
1.53425

Over 5x improvement.

Unknown as to the number of points you intend to use.

Jim Dempsey

DLake1 · ‎06-28-2019

Your code doesn't run so good for me Jim, I get about 10 FPS with 32768 points, still get hiccups and its not accurate as the points don't form clumps and accelerate away indefinitely after collapsing from the initial randomly distributed circle.

Limiting the thread count to 12 doesn't help either and its fractionally slower.

I tried 4096 points with 12 threads and it STILL hiccups.

The frame rate just drops for up to about a second every 1-3 seconds and its very irritating.

I can build you a copy to test if you like just specify the point count and instruction set.

jimdempseyatthecove · ‎06-29-2019

How does the code above as-is perform on your system? IOW compile and run it as-is and paste the results here.

Do you have any environment variables set starting with OMP_... or KMP_...? If so, what are the values
FYI I ran my tests without anything set.

Note, if ComputeNew vs Compute (as-is) shows similar performance difference on your system...
then the problem is elsewhere on your system. IOW you are not focusing on the problem code.

One of your design problems is while your code is "double buffering" points.. the caller of your Compute is single buffering p (unless you are calling with different p's). A second problem is your Compute (and my analog) is entering and exiting a parallel region.

I suggest that you consider the following (which can be done with OpenMP or other thread model)

Add to your code two atomic<int> variables: one to hold frame number displayed, and the second to hold frame number computed.

I suggest you use something like this sketch:

frameBuilt = -1 (atomic variable)
frameDisplayed = -1 (atomic variable)
start parallel region
  start master section
     entask display updater (runs until done indicator)
     do until done indicator
       if((frameBuilt - frameDisplayed) > 1)
         pause
       else
         generateNextFrame using current region without display updater thread(sd)
         (note, swap build double buffer pointer here)
         ++frameBuilt (atomic)
       endif
     end do
  end master section
end parallel region
...

display updater task
  do until done indicator
    if(frameBuilt == frameDisplayed)
         pause
       else
         update frame
         (note, swap display double buffer pointer here)
         ++frameDisplayed (atomic)
       endif
  end do
end display updater task

IOW you have a continuously running display updater looking at a double buffer shared with a (parallel) state generator.
Further, the display updater and the state generator are using threads exclusive of each other.

If you have additional tasks to run, you will have to determine if it is best to run either exclusively or inclusive (on hardware threads) of other tasks.

Jim Dempsey

jimdempseyatthecove · ‎06-29-2019

Note, I have not used the non-OpenMP new C++ "#pragma parallel" feature. Therefore I do not know the internals of how it was implemented.

If this is built on the C# model, then this is not a thread pooling scheme.

*** also note

IIF your main is C# or C++ that

loop
   spawn Compute (from newly created thread)
   do other stuff
end loop

IOW each instance of call to Compute is made from a new thread, then for (at least the OpenMP implementation) each different thread (ID) that enters (first time) a parallel region will instantiate a new thread pool. This means you exhaust resources (memory, thread handles, etc...)

I suspect what you have not disclosed is pertinent to your issues.

Jim Dempsey

DLake1 · ‎06-29-2019

I never would have thought the C# side would be the cause as its pretty simple, theres only 1 worker thread and it calls the Compute function while the main thread renders with OpenTK and looking at the process threads the OpenMP workers are static and not constantly spawning.

private void PhysicsLoop(){
	while(true){
		if(stop){
			for(var i = 0; i < pcount; ++i) { velx = vely = 0F; }
		}
		if(reset){
			reset = false;
			var r = new Random();
			for(var i = 0; i < Startcount; ++i){
				do{
					pointsx = (float)(r.NextDouble()*2.0F - 1.0F);
					pointsy = (float)(r.NextDouble()*2.0F - 1.0F);
				} while(pointsx*pointsx + pointsy*pointsy > 1.0F);
				velx = vely = 0.0F;
			}
			NativeMethods.SetData(pointsx, pointsy);
			pcount = Startcount;
			buffersize = (IntPtr)(pcount*8);
		}
		are.WaitOne();
		NativeMethods.Compute(points0, velx, vely, pcount, aspect, zoom);
		var pointstemp = points0;
		points0 = points1;
		points1 = pointstemp;
		are1.Set();
	}
}
protected override void OnRenderFrame(FrameEventArgs e){
	GL.Clear(ClearBufferMask.ColorBufferBit);
	GL.Color4(1.0F, 1.0F, 1.0F, zoom/8);
	GL.PointSize(zoom/8);
	GL.Enable(EnableCap.PointSmooth);
	GL.Enable(EnableCap.Blend);
	GL.BlendFunc(BlendingFactorSrc.SrcAlpha, BlendingFactorDest.OneMinusSrcAlpha);
	GL.EnableClientState(ArrayCap.VertexArray);
	GL.Disable(EnableCap.DepthTest);
	GL.Disable(EnableCap.CullFace);
	GL.BindBuffer(BufferTarget.ArrayBuffer, vbo);
	are1.WaitOne();
	GL.BufferData(BufferTarget.ArrayBuffer, buffersize, points1, BufferUsageHint.StaticDraw);
	are.Set();
	GL.VertexPointer(2, VertexPointerType.Float, 0, 0);
	GL.DrawArrays(PrimitiveType.Points, 0, pcount);
	GL.BindBuffer(BufferTarget.ArrayBuffer, 0);
	GL.DisableClientState(ArrayCap.VertexArray);
	SwapBuffers();
}

I tested your code in a fresh Intel compiler 18.0 console application x64 release with "/O3", "/QaxCORE-AVX2" and "/Qopenmp" and this is the output:

Compute
1.31832
1.32542
1.28535

ComputeNew
0.0829025
0.0826959
0.0834916

However if I switch the floating point model from "/fp:precise" to "/fp:fast" I get something like this:

Compute
0.182156
0.181857
0.182324

ComputeNew
0.0211988
0.0170724
0.0179812

then if I enable "/Qparallel" I get something like this:

Compute
0.022697
0.0169897
0.0162603

ComputeNew
0.0185098
0.0174561
0.018217

jimdempseyatthecove · ‎06-29-2019

Next test is to build a C# test program with an analog to the main of the C++:

private void PhysicsLoop(){
	// insert display "Compute" here
	for(var i = 0; i < 3; ++i){
		// insert get start time here
		NativeMethods.Compute(points0, velx, vely, pcount, aspect, zoom);
		// insert get end time here
		// display end time - start time here
	}
	// insert display "ComputeNew" here
	for(var i = 0; i < 3; ++i){
		// insert get start time here
		NativeMethods.ComputeNew(points0, velx, vely, pcount, aspect, zoom);
		// insert get end time here
		// display end time - start time here
	}
}

IOW remove everything but the call to Compute and ComputeNew. See what happens.

Also, in the actual applicatoin it would be appropriate to test converting OnRenderFrame from event called to an until done loop that is mutex driven like the PhysicsLoop is (i.e. maintain same thread ID). Also, then for the OpenMP loop, reduce the number of threads by 1 (or 2).

Jim Dempsey

DLake1 · ‎06-29-2019

OnRenderFrame is called by OpenTK (OpenGL) on the UI thread.

Invoking my code takes about 19ms with consistent peaks of about 32ms.

Your code takes around 90-120ms.

DLake1 · ‎07-10-2019

I just discovered that setting the thread count to 22 with

_putenv("OMP_NUM_THREADS=22");
_putenv("OMP_PROC_BIND=spread");

while using #pragma parallel

stops the performance fluctuations and I get a stable 50 FPS with 32768 points!

But any other number of threads, more threads or less, results in the performance fluctuations.

jimdempseyatthecove · ‎07-10-2019

Good work.

Try to assure that the C# thread that launches the Compute does nothing but the Compute. And that this thread, does not complete until the end of the application. This does require that it uses a means of handshake for use of frame vector p. It might be beneficial if you use two frame vectors p (p0 and p1) to add an additional level of double buffering.

Jim Dempsey

DLake1 · ‎07-10-2019

Compute is invoked in a while(true) loop and synchronised with the OnRenderFrame method that gets called by OpenTK which is an OpenGL wrapper as you can see in post #14 so the calling thread never ends.

I can monitor the threads with "Process Hacker" in real-time and all the threads in the process are static and not constantly being created.