topic then 5% - 10% faster are a in Intel® Moderncode for Parallel Architectures

Countable loops in openMP

Livshin__David — Sun, 18 Nov 2018 10:29:05 GMT

Some OpenMP related documents state that in order for loop to be treated by OpenMP is must be “countable” providing different definitions for loop being “countable”:

the number of iterations in the loop must be countable with an integer and loop use a fixed increment.
the loop count can be “determined” ( what does it mean “determined”? )

Is it indeed the requirement of OpenMP? Or is it requirement of a specific compiler implementation of OpenMP?

Can the following code ( doesn't seems to be countable ) be parallelized by OpenMP ( note that the question is if the code can be pararallelized and not if there is a way to create a parallel equivalent of the code )

for ( i = 0; i < cnt; )
{
 x1 = 2.0 * x - 1.;
 if ( x1 < 1.0 )
 {
  i = i + 3;
  x = x*2.;
 }
 else // if ( x1 >= 1. )
 {
  i = i + 2;
  x = x/2.;
 }
}

Thank you,


David

www.dalsoft.com

You would need to make the

TimP — Sun, 18 Nov 2018 12:45:57 GMT

You would need to make the parallel for run for the maximum required count. Then you could make the body of the loop conditional on on i. With static scheduling, this would imply work imbalance, so you could work with schedule(runtime) and try various choices by environment variable such as guided, auto, or dynamic. With dynamic, at least, you should try various chunk sizes. Best choices will vary with number of cores, total number of iterations, and even which openmp library is in use.

"Countable" generally means

McCalpinJohn — Sun, 18 Nov 2018 19:28:45 GMT

"Countable" generally means that the compiler can generate code that will compute the number of loop iterations without executing the loop.

Modifying the index variable outside of the increment expression in the "for" statement is often prohibited, though special cases can be countable (e.g., a simple unconditional increment of the index variable somewhere in the loop).

In your case, the update(s) of the index variable are conditional, which is usually enough to prevent the loop from being countable. To make it worse, the condition depends on a floating-point value, and that floating-point value is updated within the loop. The number of iterations in such a case may depend on the floating-point rounding mode in effect. Determining the number of iterations in general code is equivalent to solving the Halting Problem, which is not possible. https://en.wikipedia.org/wiki/Halting_problem

"compiler can generate code

Livshin__David — Sun, 18 Nov 2018 19:57:49 GMT

"compiler can generate code that will compute the number of loop iterations without executing the loop"

what kind of code? would it be acceptable to slice the code of the original loop to extract index generation and then loop that code ( and not the original loop )?

That loop is not inherently

jimdempseyatthecove — Mon, 19 Nov 2018 13:55:00 GMT

That loop is not inherently parallelizable (except in the cases of where the initial value of x is <=0.0, and in that case replace the loop with x=x*(2**cnt)). Otherwise, x has loop order dependencies.

Jim Dempsey

Edit: <= 1.0 / (2**cnt)

jimdempseyatthecove — Mon, 19 Nov 2018 16:13:18 GMT

Edit: <= 1.0 / (2**cnt)

Jim Dempsey

Although that is not a point

Livshin__David — Mon, 19 Nov 2018 16:42:25 GMT

Although that is not a point of my post, but may be in the case you mentioned the result shall be: x*(2**(cnt/3)).

Also, when you say that loop is not parallelizabe, I guess you mean "not parallelizable by OpenMP". I wrote an autoparallelizer ( see www.dalsoft.com ) that can parallelize this loop ( in fact, much more complicated loop - the code in my post is a simplified version of that loop ).

It would help if you give a

jimdempseyatthecove — Mon, 19 Nov 2018 18:23:51 GMT

It would help if you give a link to the direct page that illustrates how the loop in #1 is auto-parallelized.

Jim Dempsey

The loop that I refered to (

Livshin__David — Mon, 19 Nov 2018 18:50:53 GMT

The loop that I refered to ( of which the code in my post is a simplified version ) is:

for ( i = 0; i < cnt; )
   {
    x1 = 2.0 * x - 1.;
    if ( x1 < 1.0 )
     {
     b = exp( x1 ) * cos( x1 );
      i = i + 3;
      x = x*2.;
     }
    else // if ( x1 >= 1. )
     {
     a = sqrt( x1 ) * log( 1 / x1 );
      i = i + 2;
      x = x/2.;
     }
   }

I will present the results on the "Compiler, Architecture, And Tools Conference", see

https://software.intel.com/en-us/event/compiler-conference/2018/schedule

After the presentation ( December 17 ) would be glad to show you the code.

Please add this to your

jimdempseyatthecove — Mon, 19 Nov 2018 21:03:00 GMT

Please add this to your calendar such that those here not attending the conference can see and comment.

While having a compiler auto-parallelize the #9 loop can be parallelized.

In the serial loop, i always increments, and thus will not produce duplicate indices for . While is has not been disclosed, it may be a requirement that the initial x be > 0.0. Therefor the values inserted into b or a would not be 0.0

This untested code may be effective:

atomic<double> fix_a[cnt], fix_b[cnt];
atomic<int> fill_a,fill_b;
...

#pragma omp parallel
{
  #pragma omp sections
  {
    for (int i = 0; i < cnt; ++i)
      a = 0.0;
    #pragma omp section
    for (int i = 0; i < cnt; ++i)
      b = 0.0;
    #pragma omp section
    for (int i = 0; i < cnt; ++i)
      fix_a = 0.0;
    #pragma omp section
    for (int i = 0; i < cnt; ++i)
      fix_b = 0.0;
  } //#pragma omp end sections
} // #pragma omp parallel

fill_a = -1;
fill_b = -1;
... x = some initial value

#pragma parallel
{
  #pragma omp master
  {
    for (int i = 0; i < cnt; )
    {
      x1 = 2.0 * x - 1.;
      if ( x1 <  1.0 )
      {
        fix_b = x1; // b = exp( x1 ) * cos( x1 );
        fill_b = i;
        i = i + 3;
        x = x*2.;
      }
      else  // if ( x1 >= 1. )
      {
        fix_a = x1; // a = sqrt( x1 ) * log( 1 / x1 );
        fill_a = i;
        i = i + 2;
        x = x/2.;
      }
    } // for (int i = 0; i < cnt; )
    fill_a = cnt;
    fill_b = cnt;
  } // #pragma omp master
  // all threads here
  int empty_a = 0;
  int empty_b = 0;
  // until done
  for(;empty_a < cnt || empty_b < cnt;)
  {
    while(empty_a <= fill_a && empty_a < cnt)
    {
      if(fix_a[empty_a] != 0.0)
      {
        double x1 = fix_a[empty_a].exchange(0.0);
        if(x1)
        {
          a[empty_a] = sqrt( x1 ) * log( 1 / x1 );
        } // if(x1)
      } // if(fix_a[empty_a])
      ++empty_a;
    }
    while(empty_b <= fill_b && empty_b < cnt)
    {
      if(fix_b[empty_b] != 0.0)
      {
        double x1 = fix_b[empty_b].exchange(0.0);
        if(x1)
        {
          b[empty_b] = exp( x1 ) * cos( x1 );
        } // if(x1)
      } // if(fix_b[empty_b])
      ++empty_b;
    }
  } // for(;empty_a < cnt || empty_b < cnt;)
} // #pragma parallel

Depending on your needs, you may want to insert _mm_pause() when waiting for work.

Keep in mind you may need to modify the code.

Also, the amount of work needs to be sufficient to amortize the overhead of starting/resuming the thread team. (IOW number of iterations is relatively large).

Jim Dempsey

It should be noted that the

jimdempseyatthecove — Tue, 20 Nov 2018 15:22:49 GMT

It should be noted that the wipe of fix_a and fix_b need only be done once due to the pickers resetting to 0.0 with exchange. As to what to do with a and b there is insufficient information in your postings.

You would want to assure that the threads performing the picking were on separate cores (IOW not with multiple threads within a core).

Would it be safe to assume the sample code was taken from some actual code, and if so, what is typical of the iteration counts, path a and path b?

Jim Dempsey

As to what to do with a and b

Livshin__David — Tue, 20 Nov 2018 16:19:06 GMT

As to what to do with a and b there is insufficient information in your postings.

a and b assumed to be parameters to the routine that contains the loop.

Would it be safe to assume the sample code was taken from some actual code, and if so, what is typical of the iteration counts, path a and path b?

No, I came up with this in an attempt to show the functionality of the auto parallelizer, specifically the ability to

calculate loop count ( of not a countable loop, thus not supported by OpenMP )
resolve memory dependency for memory writes a and b
create code to calculate values needed at the entry to a thread: x and i

If you would like, I will send you ( after the conference ) my presentation that explains the above.

As to the iteration counts: in the test suite I use, the iteration count cnt is set to be 100000000 ( 8 zeros ) - which makes me wonder how practical is your solution where you introduce new arrays of the size cnt.

Also note the use of transcendentals - this is done in order to give some weight to the loop; otherwise the overhead of doing the above will make auto-parallelization to be not worth it. I ran some tests adding more calls to "expensive" routines and saw how it improves the performance of the parallelized code. You may find the following article helpful to clarify that:

http://www.dalsoft.com/Calculating_number_of_cores_to_benefit_from_parallelization.pdf

>>which makes me wonder how

jimdempseyatthecove — Tue, 20 Nov 2018 17:08:00 GMT

>>which makes me wonder how practical is your solution where you introduce new arrays of the size cnt.

To reduce the additional arrays to 1 array, it is known that the X1's generated are all > 0.0. Therefore the sign could be used to indicate the path.

As to which is faster (your auto-gen code or my specific code), well that can be tested (by one that has both codes).

Assuming x is unknown at compile time, it is not clear to me as to how you could parallelize this. This said, one (you) could have the compiler identify this type of loop (something similar to a convergence loop), and produce a preamble for single path, then enter the flip/flop for the remainder of the convergence.

Simplified code of my prior post:

atomic<double> fix_x[cnt];
atomic<int> fill_x;
...
// once only
#pragma omp parallel
{
  #pragma omp sections
  {
    for (int i = 0; i < cnt; ++i)
      a = 0.0;
    #pragma omp section
    for (int i = 0; i < cnt; ++i)
      b = 0.0;
    #pragma omp section
    for (int i = 0; i < cnt; ++i)
      fix_x = 0.0;
  } //#pragma omp end sections
} // #pragma omp parallel

fill_x = -1;
... x = some initial value

#pragma parallel
{
  #pragma omp master
  {
    for (int i = 0; i < cnt; )
    {
      x1 = 2.0 * x - 1.;
      if ( x1 <  1.0 )
      {
        fix_x = -x1; // b = exp( x1 ) * cos( x1 );
        fill_x = i;
        i = i + 3;
        x = x*2.;
      }
      else  // if ( x1 >= 1. )
      {
        fix_x = x1; // a = sqrt( x1 ) * log( 1 / x1 );
        fill_x = i;
        i = i + 2;
        x = x/2.;
      }
    } // for (int i = 0; i < cnt; )
    fill_x = cnt;
  } // #pragma omp master
  // all threads here
  int empty_x = 0;
  // until done
  for(;empty_x < cnt;)
  {
    while(empty_x <= fill_x)
    {
      if(fix_x[empty_x] != 0.0)
      {
        double x1 = fix_x[empty_x].exchange(0.0);
        if(x1)
        {
          if(x1 > 0.0)
            a[empty_x] = sqrt( x1 ) * log( 1 / x1 );
          else
            b[empty_b] = exp( -x1 ) * cos( -x1 );
        } // if(x1)
      } // if(fix_a[empty_x])
      ++empty_x;
    }
  } // for(;empty_x < cnt;)
} // #pragma parallel

Perhaps you could compare the above with your auto-generated code.

Note, if the distribution of the modified cells is somewhat random, then this code may be better:

atomic<double> fix_x[cnt];
atomic<int> fill_x;
...
// once only
#pragma omp parallel
{
  #pragma omp sections
  {
    for (int i = 0; i < cnt; ++i)
      a = 0.0;
    #pragma omp section
    for (int i = 0; i < cnt; ++i)
      b = 0.0;
    #pragma omp section
    for (int i = 0; i < cnt; ++i)
      fix_x = 0.0;
  } //#pragma omp end sections
} // #pragma omp parallel

fill_x = -1;
... x = some initial value

#pragma parallel
{
  int iThread = omp_get_thread_num();
  int nThreads = omp_get_num_threads();
  if(iThread == 0)
  {
    for (int i = 0; i < cnt; )
    {
      x1 = 2.0 * x - 1.;
      if ( x1 <  1.0 )
      {
        fix_x = -x1; // b = exp( x1 ) * cos( x1 );
        fill_x = i;
        i = i + 3;
        x = x*2.;
      }
      else  // if ( x1 >= 1. )
      {
        fix_x = x1; // a = sqrt( x1 ) * log( 1 / x1 );
        fill_x = i;
        i = i + 2;
        x = x/2.;
      }
    } // for (int i = 0; i < cnt; )
    fill_x = cnt;
  } // if(iThread == 0)
  if(nThreads > 1)
  {
    --iThread;
    --nThreads;
  }
  if(iThread >=0)
  {
    int empty_x = 0;
    // until done
    for(;empty_x < cnt;)
    {
      while(empty_x <= fill_x)
      {
        if(empty_x % nThreads == iThread)
        {
          double x1 = fix_x[empty_x];
          if(x1 != 0.0)
          {
            fix_x[empty_x] = 0.0;
            if(x1 > 0.0)
              a[empty_x] = sqrt( x1 ) * log( 1 / x1 );
            else
              b[empty_b] = exp( -x1 ) * cos( -x1 );
          } // if(x1 != 0.0)
        } // if(empty_x % nThreads == iThread)
        ++empty_x;
      }
    } // for(;empty_x < cnt;)
  }
} // #pragma parallel

Jim Dempsey

As to which is faster (your

Livshin__David — Tue, 20 Nov 2018 17:46:50 GMT

As to which is faster (your auto-gen code or my specific code), well that can be tested (by one that has both codes).

Your code wouldn't run on my machine as a and b, with the cnt as I specified, take all the memory. The rule for writing parallel code is that everything shall be done in-place, no huge memory allocations as it may be no memory available.

Assuming x is unknown at compile time, it is not clear to me as to how you could parallelize this.

I fail to understand your preoccupation with the value of x. Autoparallelizer is not bothered by that at all. Of course user should be careful to use the values of x that don't cause the exception(s), but the same exception will occur in the sequential and parallel codes. Also, should it be clear how to parallelize sequential code, I would be out of business.

Ok, then here is simplified

jimdempseyatthecove — Tue, 20 Nov 2018 18:25:00 GMT

Ok, then here is simplified parallel loop in OpenMP

#pragma omp parallel
{
  int iThread = omp_get_thread_num();
  int nThreads = omp_get_num_threads();
  int interval = 0;
  // all threads perform
   for (int i = 0; i < cnt; ++interval)
   {
    x1 = 2.0 * x - 1.; // * is relatively fast
    if ( x1 <  1.0 )
    {
      if(iterval%nThreads == iThread)
      {
        // distribute computation part
        b = exp( x1 ) * cos( x1 ); //
      }
      i = i + 3;
      x = x*2.;
    }
    else  // if ( x1 >= 1. )
    {
      if(interval%nThreads == iThread)
      {
        // distribute computation part
        a = sqrt( x1 ) * log( 1 / x1 );
      }
      i = i + 2;
      x = x/2.;
     }
   }
}

Jim Dempsey

To the best of my knowledge,

Livshin__David — Wed, 21 Nov 2018 06:03:38 GMT

To the best of my knowledge, the loop in your last post is not canonical and therefore wouldn't be accepted by OpenMP; gcc should give compilation error ( when using -fopenmp ) requesting explicit loop increment.

Please disregard the last

Livshin__David — Wed, 21 Nov 2018 19:58:16 GMT

Please disregard the last post - I missed the fact that "#pragma omp" applied to a block ( and not to a loop ).

Sorry.

Here is the code that works,

jimdempseyatthecove — Thu, 22 Nov 2018 03:24:24 GMT

Here is the code that works, however, the compute time of the "DoWork" emulation is rather short.

// GoofyLoop.cpp
//

#include "stdafx.h"
#include <iostream>

#include <immintrin.h>
#include <math.h>
#include <omp.h>

const __int64 N = 100000000;
double* a;
double* b;
const double typicalX = 3.141592653589793;

void Serial(void)
{
	double x = typicalX;
	double x1;
	__int64 cnt = N;
	for (__int64 i = 0; i < cnt;)
	{
		x1 = 2.0 * x - 1.;
		if (x1 <  1.0)
		{
			b = exp(x1) * cos(x1);
			i = i + 3;
			x = x*2.;
		}
		else  // if ( x1 >= 1. )
		{
			a = sqrt(x1) * log(1 / x1);
			i = i + 2;
			x = x / 2.;
		}
	}
}

void Parallel(void)
{
#pragma omp parallel
	{
		int iThread = omp_get_thread_num();
		int nThreads = omp_get_num_threads();
		double x = typicalX;
		double x1;
		__int64 cnt = N;
		__int64 interval = 0;
		for (__int64 i = 0; i < cnt; ++interval)
		{
			x1 = 2.0 * x - 1.;
			if (x1 < 1.0)
			{
				if (interval%nThreads == iThread)
				{
					b = exp(x1) * cos(x1);
				}
				i = i + 3;
				x = x*2.;
			}
			else  // if ( x1 >= 1. )
			{
				if (interval%nThreads == iThread)
				{
					a = sqrt(x1) * log(1 / x1);
				}
				i = i + 2;
				x = x / 2.;
			}
		}
	}
}
int _tmain(int argc, _TCHAR* argv[])
{
	a = (double*)malloc(N * sizeof(double)); // new double;
	b = (double*)malloc(N * sizeof(double)); // new double;
#pragma omp parallel
	{
#pragma omp master
		{
			std::cout << "nThreads = " << omp_get_num_threads() << std::endl;
		}
	}
#pragma omp parallel for
	for (int i = 0; i < N; ++i)
	{
		a = 0.0;
		b = 0.0;
	}

	for (int rep = 0; rep < 3; ++rep)
	{
		unsigned __int64 t0 = _rdtsc();
		Serial();
		unsigned __int64 t1 = _rdtsc();
		std::cout << "Serial ticks = " << t1 - t0 << std::endl;
	}
	std::cout << std::endl;
	for (int rep = 0; rep < 3; ++rep)
	{
		unsigned __int64 t0 = _rdtsc();
		Parallel();
		unsigned __int64 t1 = _rdtsc();
		std::cout << "Parallel ticks = " << t1 - t0 << std::endl;
	}
	return 0;
}

On a 4 core w/ HT, Core i7 2700K, running 1 thread per core:

Threads = 4
Serial ticks = 3076054566
Serial ticks = 2543030436
Serial ticks = 2547671985

Parallel ticks = 2116263348
Parallel ticks = 2116889788
Parallel ticks = 2128250491

Marginal.

3 Threads:

hreads = 3
Serial ticks = 2585388714
Serial ticks = 2603521131
Serial ticks = 2602569400

Parallel ticks = 2098115233
Parallel ticks = 2108614224
Parallel ticks = 2098695600

Slightly better.

The Do Work section is relatively small computation between memory writes, therefore it seems that for this example, the degree of (productive) parallelization is dependent upon the memory subsystem.

Jim Dempsey

I created the following test

Livshin__David — Thu, 22 Nov 2018 13:01:39 GMT

I created the following test program:

#include <stdio.h>
#include <math.h>

#define N 100000000

double a, b;

double foo( double *a, double *b, double x, unsigned int cnt )
{
double x1;
unsigned int i;

asm( "#.dco_start" );

for ( i = 0; i < cnt; i++ )
   {
    x1 = 2.0 * x - 1;
    if ( x1 < 1.0 )
     {
      x1 = exp( x1 ) * cos( x1 );
      b = x1;
      i = i + 3;
      x = x*2.;
     }
    else // if ( x1 >= 1. )
     {
      x1 = sqrt( x1 ) * log( 1. / x1 );
      a = x1;
      i = i + 2;
      x = x/2.;
     }
   }

asm( "#.dco_end" );

return x;

}

// by Jim Dempsey
double foo_Jim( double *a, double *b, double x, unsigned int cnt )
{
double x1;
unsigned int i;

#if 0

#pragma omp parallel
{
int iThread = omp_get_thread_num();
int nThreads = omp_get_num_threads();
int interval = 0;
// all threads perform
   for ( i = 0; i < cnt; ++interval)
   {
    x1 = 2.0 * x - 1.; // * is relatively fast
    if ( x1 < 1.0 )
    {
      if(interval%nThreads == iThread)
      {
        // distribute computation part
        b = exp( x1 ) * cos( x1 ); //
      }
      i = i + 3;
      x = x*2.;
    }
    else // if ( x1 >= 1. )
    {
      if(interval%nThreads == iThread)
      {
        // distribute computation part
        a = sqrt( x1 ) * log( 1 / x1 );
      }
      i = i + 2;
      x = x/2.;
     }
   }
}

#endif

return x;

}

int main()
{
double rslt, rslt1;
unsigned int i;

for ( i = 0; i < N; i++ )
   {
    a = 0.;
    b = 0.;
   }

for( i = 0; i < 10; i++ )
   {
    rslt = foo( a, b, 3.1415, N - 100 );
//    rslt = foo_Jim( a, b, 3.1415, N - 100 );
   }

rslt1 = 0.;
for ( i = 0; i < N; i++ )
   {
    rslt1 += a;
    rslt1 += b;
   }

printf( "rslt %f %f\n", rslt, rslt1 );

}

and generated 3 executables ( altering code as necessary ):

serial code: loop

parallel code generated by my auto-parallelizer: loop_dco

your code: loop_Jim

Each program was executed 3 times on the E8600 @ 3.33GHz, 2 cores Linux machine. The execution was under "time" command ( e.g. "time ./loop" ) and reported time ( see bellow ) is 'real' time produced by the "time" command that is neither the fastest nor the slowest out of 3 executions attepmted.

The execution times ( in seconds ) are:

loop: 14.99

loop_dco: 11.75

loop_Jim: 10.48

As you can see the program generates and prints checksums - for loop and loop_dco these chechsums always ( for every invocation ) fully agreed, loop_Jim was generating different checksums for every separate run (?).

Few words about the code:

loop_Jim assumes that a and b are not-overlaping memory regions and therefore may be use in parallel code; loop_dco doesnt make such an assumption and generates code to verify that dynamicaly at run time - overhead of up to 20%.

loop_jim doesnt preserve the value of x that shall be returned.

If you like, I would be glad to send you my ( Linux ) executables.

Knowing this is now memory

jimdempseyatthecove — Thu, 22 Nov 2018 15:12:50 GMT

Knowing this is now memory access bound, performing aligned allocation and organizing stores on cache lines, yields a little more improvement:

const int CacheLineSize = 64;
const int doublesInCacheLine = CacheLineSize / sizeof(double);
...
	a = (double*)_mm_malloc(N * sizeof(double), 64); // (double*)malloc(N * sizeof(double)); // new double;
	b = (double*)_mm_malloc(N * sizeof(double), 64); // (double*)malloc(N * sizeof(double)); // new double;
...
				if ((i / doublesInCacheLine)%nThreads == iThread)
...
				if ((i / doublesInCacheLine) % nThreads == iThread)

nThreads = 4
Serial ticks = 2665960090
Serial ticks = 2645705503
Serial ticks = 2551374815

Parallel ticks = 1949592124
Parallel ticks = 1957116967
Parallel ticks = 2024346334

Jim Dempsey