Re: OpenMP problem with min/max

levicki · ‎07-02-2008

I have a problem with this code:


#include 

void test(short *a, int n, short *Min, short *Max)
{
	short	sMin = SHRT_MAX;
	short	sMax = SHRT_MIN;

	for (int i = 0; i < n; i++) {
		sMin = __min(a, sMin);
		sMax = __max(a, sMax);
	}

	*Min = sMin;
	*Max = sMax;
}

Loop gets vectorized and I would like to keep it that way.

What would be the proper way to thread this using OpenMP? Is there any other way to do it more efficiently?

TimP · ‎07-02-2008

In Fortran, there are standard OpenMP reductions for this. For C, there are plenty of examples available.  Here is a related C++ fragment, for a 2-d array, where the maximum for each 1-d extent is calculated, and a critical region used to determine whether that value should update the global.

#pragma omp parallel for if(i__2 > 103)
 for (j = 1; j <= i__2; ++j)
 {
 float *maxj=max_element(&aa[1+j*aa_dim1],&aa[i__3+1+j*aa_dim1]);
#pragma omp critical
 if(*maxj > max__ || *maxj == max__ && j < yindex){
 max__= *maxj;
 xindex=maxj- &aa[1+j*aa_dim1] + 1;
 yindex=j;
 }
 }

As indicated, there should be a large number of vectorizable segments, before threading becomes a useful addition.

jimdempseyatthecove · ‎07-02-2008

Igor,

How large is n?

There is a fair amount of overhead in starting up a thread. The loop you have will likely saturate the memory bus.

If you intend to run the MinMax inside a region that is already parallized and if n is moderate of number then consider using two parallel sections, one to collect the min and one to collect the max. Note though this will fetch twice and may (srewaa may) be slower than what you have above. If the two threads share a common cache then you might see a benifit.

And, although your loop may be vectorized, is it optimal?

Is it worth your time to write SSE3 assembler code to produce the fastest code?

Jim Dempsey

levicki · ‎07-02-2008

Tim, thanks I'll take a look at those samples if you tell me where to find them.

Jim, n is up to 640,000 elements at the moment but in the future it can only grow, not shrink. I am aware that there may be overhead but since I am investigating all possibilities I wanted to try this one as well.

As for vectorization, code that 11.0 beta generates is pretty decent, I believe there would be no benefit in writing asm code because that would be harder to thread.

This code is at function top level so it is not inside of a parallel section. Actually there was more code but I separated it because it wasn't vectorizeable (check my latest thread in the compiler section if you are interested).

I don't understand why there is no reduction for < and > in OpenMP? Or perhaps there is?

jimdempseyatthecove · ‎07-03-2008

Igor,

Is the compiler generating code using PMAXSW and PMINSW (and two PSHUFW's)?

>>no benefit in writing asm code because that would be harder to thread

You call the ASM function within the parallel region. No difference in coding.

Do the preabmle and postamble alignment in C++

If you had 4 cores then each could work on 128,000 elements, Store the8 results (each min/max) into an array. Then outside the parallel region perform 4 min and 4 max functions (you could vectorize that but why).

If the operation that filled the array did so recently then with affinity set you may get good locality of data. (each with 512KB already in cache).

Jim

TimP · ‎07-03-2008

If the partial results are stored to an array within the parallel region, a false sharing situation is created. It may not matter, if the inner loops are long enough, with register accumulation and no shared memory store in the inner loop. I suppose it looks cleaner than the critical section, except that it's more difficult to optimize for a run-time selectable number of threads.

I was amazed how many references turned up in a web search, pointing out that OpenMP defines MIN and MAX reduction only for Fortran, and dropping the subject there. I don't see satisfactory examples showing the parallel vectorized loop scheme whichwe agree on in this forum thread. Even with Fortran (using compilers I have experienced) it means explicitly dividing the job into an inner vectorizable/outer parallelized loop, if reasonable efficiency is required.

levicki · ‎07-03-2008

JimDempseyAtTheCove:
Is the compiler generating code using PMAXSW and PMINSW (and two PSHUFW's)?

Yes it is, but there are three shuffles (2x PSHUFD and 1x PSHUFLW) for horizontal reduction. Since I used QxSSE4.1 I was hoping it would use PHMINPOSUW but it isn't.

By the way, I don't quite understand why is it that whenever I suggest an instruction for the addition into the x86 instruction set someone from Intel starts talking about orthogonality as an issue when they alone obviously don't care about it (read: PHMINPOSSW, PHMAXPOSUW, PHMAXPOSSW are missing)?!?

JimDempseyAtTheCove:
You call the ASM function within the parallel region. No difference in coding.

I am not exactly sure how would I do that? Can you give an example?

JimDempseyAtTheCove:
Do the preabmle and postamble alignment in C++

No need for alignment, just for the loop tail in case that n isn't divisible by 4.

tim18:
I don't see satisfactory examples showing the parallel vectorized loop scheme which we agree on in this forum thread.

My thoughts exactly and the problem is both common and trivial. Conclusion: OpenMP sucks because you can't define your own reduction operators (for example a function to call to perform the reduction). On the other hand using TBB just for this is an overkill. In my opinion OpenMP needs fixing. So... OpenMP 3.1 anyone?

jimdempseyatthecove · ‎07-03-2008

Igor,

Something like this

volatile int theMin = MAX_INT;// Igor, pick what you want for these
volatile int theMax = MIN_INT;
int threadsUsed;
int chunkSize;
#pragma omp parallel
{
threadsUsed = omp_get_num_threads(); // all threads will overstrike threadsUsed
threadNum = omp_get_thread_num();
chunkSize = arraySize / threadsUsed; // all threads will overstrikechunkSize
int localMin;
int localMax;
findMinMax(&Array[chunkSize * threadNum], chunkSize, &localMin, &localMax);
while(localMin < theMin)
{
int bufferedMin = theMin;
if(localMin >= bufferedMin) break;// in case other thread changed to new min below ours
_InterlockedCompareExchange(&theMin, localMin, bufferedMin);
}

}
// to-do - test for tail and fixup.

Then write an ASM function for findMinMax

Jim Dempsey

levicki · ‎07-03-2008

Thanks, Jim. Will investigate that path as well but at the first glance it seems that you have forgot about the maximum, you should have used _InterlockedCompareExchange16() (with a volatile short theMin, theMax), and the threadNum is not defined.

jimdempseyatthecove · ‎07-04-2008

Igor,

As I said "something like" for the code sample. Fixing up the code is an exercize for you. My assistance was to show you how to structure a parallel region for calling an ASM funciton. You also might want to test the return value from _Interlocked... but then the while test is being tested twice. And you may want partial chunks out in page size chunks, etc...

Good luck on writting your code.

Jim

levicki · ‎07-04-2008

Jim,

Just to be clear I wasn't criticising you — if anyone around here knows about threading it is probably you, that is why I was expecting working example. I don't have much experience with it yet so I would most likely pick an inefficient way to include maximum testing. I am thankfull for your help.

jimdempseyatthecove · ‎07-04-2008

Igor,

I guess I am having a bad day (or month). Trying to resolve deployment issues involving space elevators. The simulations have been blowing up. And I have less than a week to pull together a paper. If I did the programming for you, then you wouldn't have gained the experience for your next problem.

For the ASM subroutine, I suggest you consider making a pass through the chunk of data using PMINSW and PMAXSW to pull together the 4 or 8 mins and maxes (64-bit or 128-bit depending on your preference) then extract each out of one of the four/eight results for each. i.e treat the chunk of N integers as a [4,N/4] or [8,N/8] arrays of integers. Scan the array to get the Min and Max for each of the 4 or 8 columns. I would have the ASM routine return the 4 or 8 mins and maxes instead of the single min or max.Have the out of parallel region call the ASM routine to produce the 4 or 8 mins and maxes from the results of each of the threads (or each of the chunks), then finally reduce the 4 or 8 mins and maxes to a single min and max. This way, if the list is large, and if each core is not available 100% of the time (e.g. system doing other work) you can use smaller chunk sizes and distribute the chunksin the following manner.

a) each chunk has an available/taken flag that is used to indicate if processing began (don't care for finished state). The _InterlockedCompareExchange is used to transition/test the flag from available to taken if if formerly available then process the chunk.
b) the list of chunks have aOMP threadnumber (team member number)preference(e.g. mod(nChunks,nThreads)==threadNumber).
c) each thread makes two scans of the chunk list 1) scan throught the chunk list of it's preference to find/take available and 2) scan through the chunk lists for chunks not of it's preference and available (then attempt to take and process).
d) after 2nd pass, return the 4 or 8 mins and maxes

You will have to flesh this out.

The above is complicated and won't have any payback unless the system is not 100% available (busy doing other work or you have nested parallel regions).

You could make the algorithm a bit more complex and make a 3rd, 4th,5th, ...state for the available/taken, that being done taken 2 times, taken 3 times, ....

When a thread has made the two passes as described above, it then makes a 3rd pass looking for chunks that are taken but not done. This would occure if a thread were stalled due to context switch.The 3rd pass should have a sequence counter (likely the flag word) where you first find a chunk which is not availabe, not done, and has only been taken 1 time. Then on a 4th pass accept the taken 2 times. etc... Since the same chunk may have a begin processing by multiple threads you would also want to occasionally look at the chunk flag word to see if a different thread has finished the chunk and you can bail out.

Suggested flag states: 0=available, -1=done, 1=taken once, 2=taken twice, ...,n=taken n times

Jim Dempsey

levicki · ‎07-05-2008

Jim,

Thanks for the ideas, I'll look into it. Hopefully your simulations start working.