Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software
- Software Development Topics
- Intel® Moderncode for Parallel Architectures
- Countable loops in openMP

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Livshin__David

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-18-2018
02:29 AM

766 Views

Countable loops in openMP

Some OpenMP related documents state that in order for loop to be treated by OpenMP is must be “countable” providing different definitions for loop being “countable”:

- the number of iterations in the loop must be countable with an integer and loop use a fixed increment.
- the loop count can be “determined” ( what does it mean “determined”? )

Is it indeed the requirement of OpenMP? Or is it requirement of a specific compiler implementation of OpenMP?

Can the following code ( doesn't seems to be countable ) be parallelized by OpenMP ( note that the question is if the code can be pararallelized and not if there is a way to create a parallel equivalent of the code )

for ( i = 0; i < cnt; ) { x1 = 2.0 * x - 1.; if ( x1 < 1.0 ) { i = i + 3; x = x*2.; } else // if ( x1 >= 1. ) { i = i + 2; x = x/2.; } } Thank you, David

Link Copied

41 Replies

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-18-2018
04:45 AM

491 Views

McCalpinJohn

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-18-2018
11:28 AM

491 Views

"Countable" generally means that the compiler can generate code that will compute the number of loop iterations without executing the loop.

Modifying the index variable outside of the increment expression in the "for" statement is often prohibited, though special cases can be countable (e.g., a simple unconditional increment of the index variable somewhere in the loop).

In your case, the update(s) of the index variable are conditional, which is usually enough to prevent the loop from being countable. To make it worse, the condition depends on a floating-point value, and that floating-point value is updated within the loop. The number of iterations in such a case may depend on the floating-point rounding mode in effect. Determining the number of iterations in general code is equivalent to solving the Halting Problem, which is not possible. https://en.wikipedia.org/wiki/Halting_problem

Livshin__David

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-18-2018
11:57 AM

491 Views

"compiler can generate code that will compute the number of loop iterations without executing the loop"

what kind of code? would it be acceptable to slice the code of the original loop to extract index generation and then loop that code ( and not the original loop )?

jimdempseyatthecove

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-19-2018
05:55 AM

491 Views

That loop is not inherently parallelizable (except in the cases of where the initial value of x is <=0.0, and in that case replace the loop with x=x*(2**cnt)). Otherwise, x has loop order dependencies.

Jim Dempsey

jimdempseyatthecove

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-19-2018
08:13 AM

490 Views

Edit: <= 1.0 / (2**cnt)

Jim Dempsey

Livshin__David

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-19-2018
08:42 AM

490 Views

Although that is not a point of my post, but may be in the case you mentioned the result shall be: x*(2**(cnt/3)).

Also, when you say that loop is not parallelizabe, I guess you mean "not parallelizable by OpenMP". I wrote an autoparallelizer ( see www.dalsoft.com ) that can parallelize this loop ( in fact, much more complicated loop - the code in my post is a simplified version of that loop ).

jimdempseyatthecove

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-19-2018
10:23 AM

490 Views

It would help if you give a link to the direct page that illustrates how the loop in #1 is auto-parallelized.

Jim Dempsey

Livshin__David

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-19-2018
10:50 AM

490 Views

The loop that I refered to ( of which the code in my post is a simplified version ) is:

for ( i = 0; i < cnt; )

{

x1 = 2.0 * x - 1.;

if ( x1 < 1.0 )

{

b* = exp( x1 ) * cos( x1 ); i = i + 3; x = x*2.; } else // if ( x1 >= 1. ) { a*

I will present the results on the "Compiler, Architecture, And Tools Conference", see

https://software.intel.com/en-us/event/compiler-conference/2018/schedule

After the presentation ( December 17 ) would be glad to show you the code.

jimdempseyatthecove

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-19-2018
01:03 PM

490 Views

Please add this to your calendar such that those here not attending the conference can see and comment.

While having a compiler auto-parallelize the #9 loop can be parallelized.

In the serial loop, i always increments, and thus will not produce duplicate indices for *. While is has not been disclosed, it may be a requirement that the initial x be > 0.0. Therefor the values inserted into b or a would not be 0.0*

This untested code may be effective:

atomic<double> fix_a[cnt], fix_b[cnt]; atomic<int> fill_a,fill_b; ... #pragma omp parallel { #pragma omp sections { for (int i = 0; i < cnt; ++i) a= 0.0; #pragma omp section for (int i = 0; i < cnt; ++i) b= 0.0; #pragma omp section for (int i = 0; i < cnt; ++i) fix_a= 0.0; #pragma omp section for (int i = 0; i < cnt; ++i) fix_b= 0.0; } //#pragma omp end sections } // #pragma omp parallel fill_a = -1; fill_b = -1; ... x = some initial value #pragma parallel { #pragma omp master { for (int i = 0; i < cnt; ) { x1 = 2.0 * x - 1.; if ( x1 < 1.0 ) { fix_b= x1; // b= exp( x1 ) * cos( x1 ); fill_b = i; i = i + 3; x = x*2.; } else // if ( x1 >= 1. ) { fix_a= x1; // a= sqrt( x1 ) * log( 1 / x1 ); fill_a = i; i = i + 2; x = x/2.; } } // for (int i = 0; i < cnt; ) fill_a = cnt; fill_b = cnt; } // #pragma omp master // all threads here int empty_a = 0; int empty_b = 0; // until done for(;empty_a < cnt || empty_b < cnt;) { while(empty_a <= fill_a && empty_a < cnt) { if(fix_a[empty_a] != 0.0) { double x1 = fix_a[empty_a].exchange(0.0); if(x1) { a[empty_a] = sqrt( x1 ) * log( 1 / x1 ); } // if(x1) } // if(fix_a[empty_a]) ++empty_a; } while(empty_b <= fill_b && empty_b < cnt) { if(fix_b[empty_b] != 0.0) { double x1 = fix_b[empty_b].exchange(0.0); if(x1) { b[empty_b] = exp( x1 ) * cos( x1 ); } // if(x1) } // if(fix_b[empty_b]) ++empty_b; } } // for(;empty_a < cnt || empty_b < cnt;) } // #pragma parallel

Depending on your needs, you may want to insert _mm_pause() when waiting for work.

Keep in mind you may need to modify the code.

Also, the amount of work needs to be sufficient to amortize the overhead of starting/resuming the thread team. (IOW number of iterations is relatively large).

Jim Dempsey

jimdempseyatthecove

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-20-2018
07:22 AM

490 Views

It should be noted that the wipe of fix_a and fix_b need only be done once due to the pickers resetting to 0.0 with exchange. As to what to do with a and b there is insufficient information in your postings.

You would want to assure that the threads performing the picking were on separate cores (IOW not with multiple threads within a core).

Would it be safe to assume the sample code was taken from some actual code, and if so, what is typical of the iteration counts, path a and path b?

Jim Dempsey

Livshin__David

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-20-2018
08:19 AM

490 Views

As to what to do with a and b there is insufficient information in your postings.

a and b assumed to be parameters to the routine that contains the loop.

Would it be safe to assume the sample code was taken from some actual code, and if so, what is typical of the iteration counts, path a and path b?

No, I came up with this in an attempt to show the functionality of the auto parallelizer, specifically the ability to

- calculate loop count ( of not a countable loop, thus not supported by OpenMP )
- resolve memory dependency for memory writes a and b
- create code to calculate values needed at the entry to a thread: x and i

If you would like, I will send you ( after the conference ) my presentation that explains the above.

As to the iteration counts: in the test suite I use, the iteration count cnt is set to be 100000000 ( 8 zeros ) - which makes me wonder how practical is your solution where you introduce new arrays of the size cnt.

Also note the use of transcendentals - this is done in order to give some weight to the loop; otherwise the overhead of doing the above will make auto-parallelization to be not worth it. I ran some tests adding more calls to "expensive" routines and saw how it improves the performance of the parallelized code. You may find the following article helpful to clarify that:

http://www.dalsoft.com/Calculating_number_of_cores_to_benefit_from_parallelization.pdf

jimdempseyatthecove

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-20-2018
09:08 AM

490 Views

>>which makes me wonder how practical is your solution where you introduce new arrays of the size cnt.

To reduce the additional arrays to 1 array, it is known that the X1's generated are all > 0.0. Therefore the sign could be used to indicate the path.

As to which is faster (your auto-gen code or my specific code), well that can be tested (by one that has both codes).

Assuming x is unknown at compile time, it is not clear to me as to how you could parallelize this. This said, one (you) could have the compiler identify this type of loop (something similar to a convergence loop), and produce a preamble for single path, then enter the flip/flop for the remainder of the convergence.

Simplified code of my prior post:

atomic<double> fix_x[cnt]; atomic<int> fill_x; ... // once only #pragma omp parallel { #pragma omp sections { for (int i = 0; i < cnt; ++i) a= 0.0; #pragma omp section for (int i = 0; i < cnt; ++i) b= 0.0; #pragma omp section for (int i = 0; i < cnt; ++i) fix_x= 0.0; } //#pragma omp end sections } // #pragma omp parallel fill_x = -1; ... x = some initial value #pragma parallel { #pragma omp master { for (int i = 0; i < cnt; ) { x1 = 2.0 * x - 1.; if ( x1 < 1.0 ) { fix_x= -x1; // b= exp( x1 ) * cos( x1 ); fill_x = i; i = i + 3; x = x*2.; } else // if ( x1 >= 1. ) { fix_x= x1; // a= sqrt( x1 ) * log( 1 / x1 ); fill_x = i; i = i + 2; x = x/2.; } } // for (int i = 0; i < cnt; ) fill_x = cnt; } // #pragma omp master // all threads here int empty_x = 0; // until done for(;empty_x < cnt;) { while(empty_x <= fill_x) { if(fix_x[empty_x] != 0.0) { double x1 = fix_x[empty_x].exchange(0.0); if(x1) { if(x1 > 0.0) a[empty_x] = sqrt( x1 ) * log( 1 / x1 ); else b[empty_b] = exp( -x1 ) * cos( -x1 ); } // if(x1) } // if(fix_a[empty_x]) ++empty_x; } } // for(;empty_x < cnt;) } // #pragma parallel

Perhaps you could compare the above with your auto-generated code.

Note, if the distribution of the modified cells is somewhat random, then this code may be better:

atomic<double> fix_x[cnt]; atomic<int> fill_x; ... // once only #pragma omp parallel { #pragma omp sections { for (int i = 0; i < cnt; ++i) a= 0.0; #pragma omp section for (int i = 0; i < cnt; ++i) b= 0.0; #pragma omp section for (int i = 0; i < cnt; ++i) fix_x= 0.0; } //#pragma omp end sections } // #pragma omp parallel fill_x = -1; ... x = some initial value #pragma parallel { int iThread = omp_get_thread_num(); int nThreads = omp_get_num_threads(); if(iThread == 0) { for (int i = 0; i < cnt; ) { x1 = 2.0 * x - 1.; if ( x1 < 1.0 ) { fix_x= -x1; // b= exp( x1 ) * cos( x1 ); fill_x = i; i = i + 3; x = x*2.; } else // if ( x1 >= 1. ) { fix_x= x1; // a= sqrt( x1 ) * log( 1 / x1 ); fill_x = i; i = i + 2; x = x/2.; } } // for (int i = 0; i < cnt; ) fill_x = cnt; } // if(iThread == 0) if(nThreads > 1) { --iThread; --nThreads; } if(iThread >=0) { int empty_x = 0; // until done for(;empty_x < cnt;) { while(empty_x <= fill_x) { if(empty_x % nThreads == iThread) { double x1 = fix_x[empty_x]; if(x1 != 0.0) { fix_x[empty_x] = 0.0; if(x1 > 0.0) a[empty_x] = sqrt( x1 ) * log( 1 / x1 ); else b[empty_b] = exp( -x1 ) * cos( -x1 ); } // if(x1 != 0.0) } // if(empty_x % nThreads == iThread) ++empty_x; } } // for(;empty_x < cnt;) } } // #pragma parallel

Jim Dempsey

Livshin__David

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-20-2018
09:46 AM

490 Views

As to which is faster (your auto-gen code or my specific code), well that can be tested (by one that has both codes).

Your code wouldn't run on my machine as a and b, with the cnt as I specified, take all the memory. The rule for writing parallel code is that everything shall be done in-place, no huge memory allocations as it may be no memory available.

Assuming x is unknown at compile time, it is not clear to me as to how you could parallelize this.

I fail to understand your preoccupation with the value of x. Autoparallelizer is not bothered by that at all. Of course user should be careful to use the values of x that don't cause the exception(s), but the same exception will occur in the sequential and parallel codes. Also, should it be clear how to parallelize sequential code, I would be out of business.

jimdempseyatthecove

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-20-2018
10:25 AM

490 Views

Ok, then here is simplified parallel loop in OpenMP

#pragma omp parallel { int iThread = omp_get_thread_num(); int nThreads = omp_get_num_threads(); int interval = 0; // all threads perform for (int i = 0; i < cnt; ++interval) { x1 = 2.0 * x - 1.; // * is relatively fast if ( x1 < 1.0 ) { if(iterval%nThreads == iThread) { // distribute computation part b= exp( x1 ) * cos( x1 ); // } i = i + 3; x = x*2.; } else // if ( x1 >= 1. ) { if(interval%nThreads == iThread) { // distribute computation part a= sqrt( x1 ) * log( 1 / x1 ); } i = i + 2; x = x/2.; } } }

Jim Dempsey

Livshin__David

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-20-2018
10:03 PM

490 Views

Livshin__David

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-21-2018
11:58 AM

490 Views

Please disregard the last post - I missed the fact that "#pragma omp" applied to a block ( and not to a loop ).

Sorry.

jimdempseyatthecove

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-21-2018
07:24 PM

490 Views

Here is the code that works, however, the compute time of the "DoWork" emulation is rather short.

// GoofyLoop.cpp // #include "stdafx.h" #include <iostream> #include <immintrin.h> #include <math.h> #include <omp.h> const __int64 N = 100000000; double* a; double* b; const double typicalX = 3.141592653589793; void Serial(void) { double x = typicalX; double x1; __int64 cnt = N; for (__int64 i = 0; i < cnt;) { x1 = 2.0 * x - 1.; if (x1 < 1.0) { b= exp(x1) * cos(x1); i = i + 3; x = x*2.; } else // if ( x1 >= 1. ) { a= sqrt(x1) * log(1 / x1); i = i + 2; x = x / 2.; } } } void Parallel(void) { #pragma omp parallel { int iThread = omp_get_thread_num(); int nThreads = omp_get_num_threads(); double x = typicalX; double x1; __int64 cnt = N; __int64 interval = 0; for (__int64 i = 0; i < cnt; ++interval) { x1 = 2.0 * x - 1.; if (x1 < 1.0) { if (interval%nThreads == iThread) { b= exp(x1) * cos(x1); } i = i + 3; x = x*2.; } else // if ( x1 >= 1. ) { if (interval%nThreads == iThread) { a= sqrt(x1) * log(1 / x1); } i = i + 2; x = x / 2.; } } } } int _tmain(int argc, _TCHAR* argv[]) { a = (double*)malloc(N * sizeof(double)); // new double; b = (double*)malloc(N * sizeof(double)); // new double ; #pragma omp parallel { #pragma omp master { std::cout << "nThreads = " << omp_get_num_threads() << std::endl; } } #pragma omp parallel for for (int i = 0; i < N; ++i) { a = 0.0; b= 0.0; } for (int rep = 0; rep < 3; ++rep) { unsigned __int64 t0 = _rdtsc(); Serial(); unsigned __int64 t1 = _rdtsc(); std::cout << "Serial ticks = " << t1 - t0 << std::endl; } std::cout << std::endl; for (int rep = 0; rep < 3; ++rep) { unsigned __int64 t0 = _rdtsc(); Parallel(); unsigned __int64 t1 = _rdtsc(); std::cout << "Parallel ticks = " << t1 - t0 << std::endl; } return 0; }

On a 4 core w/ HT, Core i7 2700K, running 1 thread per core:

Threads = 4 Serial ticks = 3076054566 Serial ticks = 2543030436 Serial ticks = 2547671985 Parallel ticks = 2116263348 Parallel ticks = 2116889788 Parallel ticks = 2128250491

Marginal.

3 Threads:

hreads = 3 Serial ticks = 2585388714 Serial ticks = 2603521131 Serial ticks = 2602569400 Parallel ticks = 2098115233 Parallel ticks = 2108614224 Parallel ticks = 2098695600

Slightly better.

The Do Work section is relatively small computation between memory writes, therefore it seems that for this example, the degree of (productive) parallelization is dependent upon the memory subsystem.

Jim Dempsey

Livshin__David

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-22-2018
05:01 AM

490 Views

I created the following test program:

#include <stdio.h>

#include <math.h>

#define N 100000000

double a

double foo( double *a, double *b, double x, unsigned int cnt )

{

double x1;

unsigned int i;

asm( "#.dco_start" );

for ( i = 0; i < cnt; i++ )

{

x1 = 2.0 * x - 1;

if ( x1 < 1.0 )

{

x1 = exp( x1 ) * cos( x1 );

b* = x1; i = i + 3; x = x*2.; } else // if ( x1 >= 1. ) { x1 = sqrt( x1 ) * log( 1. / x1 ); a*

asm( "#.dco_end" );

return x;

}

// by Jim Dempsey

double foo_Jim( double *a, double *b, double x, unsigned int cnt )

{

double x1;

unsigned int i;

#if 0

#pragma omp parallel

{

int iThread = omp_get_thread_num();

int nThreads = omp_get_num_threads();

int interval = 0;

// all threads perform

for ( i = 0; i < cnt; ++interval)

{

x1 = 2.0 * x - 1.; // * is relatively fast

if ( x1 < 1.0 )

{

if(interval%nThreads == iThread)

{

// distribute computation part

b* = exp( x1 ) * cos( x1 ); // } i = i + 3; x = x*2.; } else // if ( x1 >= 1. ) { if(interval%nThreads == iThread) { // distribute computation part a*

#endif

return x;

}

int main()

{

double rslt, rslt1;

unsigned int i;

for ( i = 0; i < N; i++ )

{

a* = 0.; b = 0.; }*

for( i = 0; i < 10; i++ )

{

rslt = foo( a, b, 3.1415, N - 100 );

// rslt = foo_Jim( a, b, 3.1415, N - 100 );

}

rslt1 = 0.;

for ( i = 0; i < N; i++ )

{

rslt1 += a*; rslt1 += b ; }*

printf( "rslt %f %f\n", rslt, rslt1 );

}

and generated 3 executables ( altering code as necessary ):

serial code: loop

parallel code generated by my auto-parallelizer: loop_dco

your code: loop_Jim

Each program was executed 3 times on the E8600 @ 3.33GHz, 2 cores Linux machine. The execution was under "time" command ( e.g. "time ./loop" ) and reported time ( see bellow ) is 'real' time produced by the "time" command that is neither the fastest nor the slowest out of 3 executions attepmted.

The execution times ( in seconds ) are:

loop: 14.99

loop_dco: 11.75

loop_Jim: 10.48

As you can see the program generates and prints checksums - for loop and loop_dco these chechsums always ( for every invocation ) fully agreed, loop_Jim was generating different checksums for every separate run (?).

Few words about the code:

loop_Jim assumes that a and b are not-overlaping memory regions and therefore may be use in parallel code; loop_dco doesnt make such an assumption and generates code to verify that dynamicaly at run time - overhead of up to 20%.

loop_jim doesnt preserve the value of x that shall be returned.

If you like, I would be glad to send you my ( Linux ) executables.

jimdempseyatthecove

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-22-2018
07:12 AM

490 Views

Knowing this is now memory access bound, performing aligned allocation and organizing stores on cache lines, yields a little more improvement:

const int CacheLineSize = 64; const int doublesInCacheLine = CacheLineSize / sizeof(double); ... a = (double*)_mm_malloc(N * sizeof(double), 64); // (double*)malloc(N * sizeof(double)); // new double; b = (double*)_mm_malloc(N * sizeof(double), 64); // (double*)malloc(N * sizeof(double)); // new double ; ... if ((i / doublesInCacheLine)%nThreads == iThread) ... if ((i / doublesInCacheLine) % nThreads == iThread)

nThreads = 4 Serial ticks = 2665960090 Serial ticks = 2645705503 Serial ticks = 2551374815 Parallel ticks = 1949592124 Parallel ticks = 1957116967 Parallel ticks = 2024346334

Jim Dempsey

jimdempseyatthecove

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-22-2018
07:32 AM

265 Views

I get consistent results. time command line includes program load time, page file initialization time, etc... and is not too useful for loop analysis.

// GoofyLoop.cpp : Defines the entry point for the console application. // #include "stdafx.h" #include <iostream> #include <immintrin.h> #include <math.h> #include <omp.h> const __int64 N = 100000000; double* a; double* b; const double typicalX = 3.141592653589793; const int CacheLineSize = 64; const int doublesInCacheLine = CacheLineSize / sizeof(double); double Serial(void) { double x = typicalX; double x1; __int64 cnt = N; for (__int64 i = 0; i < cnt;) { x1 = 2.0 * x - 1.; if (x1 < 1.0) { b= exp(x1) * cos(x1); i = i + 3; x = x*2.; } else // if ( x1 >= 1. ) { a= sqrt(x1) * log(1 / x1); i = i + 2; x = x / 2.; } } return x; } double Parallel(void) { double rslt; #pragma omp parallel { int iThread = omp_get_thread_num(); int nThreads = omp_get_num_threads(); double x = typicalX; double x1; __int64 cnt = N; for (__int64 i = 0; i < cnt;) { x1 = 2.0 * x - 1.; if (x1 < 1.0) { if ((i / doublesInCacheLine)%nThreads == iThread) { b= exp(x1) * cos(x1); } i = i + 3; x = x*2.; } else // if ( x1 >= 1. ) { if ((i / doublesInCacheLine) % nThreads == iThread) { a= sqrt(x1) * log(1 / x1); } i = i + 2; x = x / 2.; } } rslt = x; } return rslt; } void Wipe() { #pragma omp parallel for for (__int64 i = 0; i < N; ++i) { a= 0.0; b= 0.0; } } double Tally() { double tally = 0.0; for (__int64 i = 0; i < N; ++i) { tally += (a*i) + (b*i); } return tally; } int _tmain(int argc, _TCHAR* argv[]) { a = (double*)_mm_malloc(N * sizeof(double), 64); // (double*)malloc(N * sizeof(double)); // new double; b = (double*)_mm_malloc(N * sizeof(double), 64); // (double*)malloc(N * sizeof(double)); // new double ; #pragma omp parallel { #pragma omp master { std::cout << "nThreads = " << omp_get_num_threads() << std::endl; } } for (int rep = 0; rep < 3; ++rep) { Wipe(); unsigned __int64 t0 = _rdtsc(); double rslt = Serial(); unsigned __int64 t1 = _rdtsc(); double tally = Tally(); std::cout << "Serial rslt = " << rslt << " tally = " << tally << " ticks = " << t1 - t0 << std::endl; } std::cout << std::endl; for (int rep = 0; rep < 3; ++rep) { Wipe(); unsigned __int64 t0 = _rdtsc(); double rslt = Parallel(); unsigned __int64 t1 = _rdtsc(); double tally = Tally(); std::cout << "Parallel rslt = " << rslt << " tally = " << tally << " ticks = " << t1 - t0 << std::endl; } return 0; }

nThreads = 4

Serial rslt = 1.5708 tally = 3.74665e+014 ticks = 2737953921

Serial rslt = 1.5708 tally = 3.74665e+014 ticks = 2629847324

Serial rslt = 1.5708 tally = 3.74665e+014 ticks = 2616123552

Parallel rslt = 1.5708 tally = 3.74665e+014 ticks = 2057548183

Parallel rslt = 1.5708 tally = 3.74665e+014 ticks = 2159630511

Parallel rslt = 1.5708 tally = 3.74665e+014 ticks = 2201760969

Jim Dempsey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

For more complete information about compiler optimizations, see our Optimization Notice.