- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey tim18,
you're absolutely right. C misses intrinsic operators for min/max. Fortran has those and that's why they have been enabled with OpenMP. If you have a real world case that would benefit from better OpenMP support for reductions (e.g. min/max, custom types), I'd be happy to gain access to it.I'm seeking for good examples that justify the use of new kinds of reductions in OpenMP. I'm also looking for the work-arounds that people currently need to take.
What do you mean with esoteric C++ stuff that is being coupled into OpenMP? The OpenMP ARB does not focus on any of the three languages but keeps new features compatible to all three base languages.
Cheers,
-michael
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey tim18,
you're absolutely right. C misses intrinsic operators for min/max. Fortran has those and that's why they have been enabled with OpenMP. If you have a real world case that would benefit from better OpenMP support for reductions (e.g. min/max, custom types), I'd be happy to gain access to it.I'm seeking for good examples that justify the use of new kinds of reductions in OpenMP. I'm also looking for the work-arounds that people currently need to take.
What do you mean with esoteric C++ stuff that is being coupled into OpenMP? The OpenMP ARB does not focus on any of the three languages but keeps new features compatible to all three base languages.
Cheers,
-michael
My own reduction code, which gives each thread groups of data segments and private variables to accumulate partial results, then combines results in a critical section, scales up to 8 threads on NHM with no problem, but doesn't scale further (to 12 threads on Westmere), where it seems a little more attention to the platform would do so. I'll have to work on it more to see if I can demonstrate an advantage for the Fortran built-in OpenMP reduction.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tim,
Have you considered creating a min/max reduction array, one element per thread/task, partition the work to threads/tasks passing in the address of the results array. Then on completion, have the main thread perform the final min/max without using critical section?
IOW, each thread computes local min/max, then writes onceto shared array of min/maxwith local result (no lock nor CAS). The main thread can then either wait for all threads to complete (parallel construct synchronization object)OR simply walk through results array waiting for pre-load values to change off of HUGE_VAL (or -HUGE_VAL) (no task synchronization object).
I suspect that when you max'ed out at 12 threads you hit a memory bandwidth problem. Nothing much you can do after that (assuming you are already using SSE) other than reordering your code such that your min/max function proceeds while data is in cache.
Overlapping functionality to take advantage of cache locality is often tricky to do since adding this additional functionality into a well written loop makes the loop less generalized and also tends to insert branches where there were none (slows down the code).
What can aid in this overlapping process and can maintain your well written tight loops separate from the additional functionality (min/max function) is by recoding to use a pipeline architecture. Through use of a pipeline you can maintain cache locality as you pass through functional pipes. Often no change (or very little change)is required to you original code. When change is require, it usualy relates to function entry arguments and not the execution of the functional algorithm.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tim,
Have you considered creating a min/max reduction array, one element per thread/task, partition the work to threads/tasks passing in the address of the results array. Then on completion, have the main thread perform the final min/max without using critical section?
IOW, each thread computes local min/max, then writes onceto shared array of min/maxwith local result (no lock nor CAS). The main thread can then either wait for all threads to complete (parallel construct synchronization object)OR simply walk through results array waiting for pre-load values to change off of HUGE_VAL (or -HUGE_VAL) (no task synchronization object).
I suspect that when you max'ed out at 12 threads you hit a memory bandwidth problem. Nothing much you can do after that (assuming you are already using SSE) other than reordering your code such that your min/max function proceeds while data is in cache.
Overlapping functionality to take advantage of cache locality is often tricky to do since adding this additional functionality into a well written loop makes the loop less generalized and also tends to insert branches where there were none (slows down the code).
What can aid in this overlapping process and can maintain your well written tight loops separate from the additional functionality (min/max function) is by recoding to use a pipeline architecture. Through use of a pipeline you can maintain cache locality as you pass through functional pipes. Often no change (or very little change)is required to you original code. When change is require, it usualy relates to function entry arguments and not the execution of the functional algorithm.
Jim Dempsey
As long as C++ specific threading features are becoming widespread, I agreed to put forward our colleague's proposal to add one analogous to the Fortran.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Great, now only if they agree to implement it :)
Another thing I would really like to see in ICC is natural alignment for __m64, __m128 and other SIMD vector datatypes without the need to use __declspec(align(n)) all over the code. That would also allow us to use new and delete to allocate aligned arrays. Adding another overload for new and delete with additional parameter which says "align to n bytes" would also be very welcome.
In my opinion those types already are a part of the language which should follow the hardware as it evolves instead of becoming a constraint and requiring additional work from the developers to get some basic things done.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The 64-bit OS give automatic 16-byte alignment in most of the situations where it is needed. I think it's an open question whether better support for 32-byte alignment will be needed to support AVX. I've heard discussion of an argument to the icc specific pragmas, such as
#pragma vector aligned(32)
so I suppose your suggestion of such an overload for new and delete may help with that problem as well as dealing with already existing problems on 32-bit OS.
I would expect such a feature request might better be submitted by a more expert C++ developer than myself. If there isn't already such an overload in TBB, maybe it could be proposed and adopted more quickly there.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
C omp reduction max and min operators were included in OpenMP 3.1 standard, but I'm still looking for a widely available compiler which implements it. gcc testsuite includes Fortran but not c or c++ tests for max or min reduction. OpenMP 4.0 defines both parallel and simd capabilities for min and max reduction; apparently, Intel compilers will advertise OpenMP 4 support before these have been implemented. Other OpenMP 4 reductions are supported now in current icc.
icpc does an excellent job without omp simd reduction directive of vectorizing std::max(). Sometimes, icc can vectorize equivalent C code, possibly with the help of #pragma vector; more often, gcc can accomplish max/min vectorization. Cilk(tm) Plus includes equivalent reducers, but they seem to be less reliably optimized than gcc is with standard C source code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The inclusion of min/max as reduction in C would be of great value for me. I have a nearest neighbor search big problem, for surface registration that is calling for that. The lack of min/max reduction force alternative implementations that look like to be suboptimal and not clear.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I beg your pardon, English is not native language and sometimes I can not express myself in the right way. I was trying to support the idea of adopting the OpenMP 3.1 min/max reduction for C and C++ , not asking for C or C++ standard modification !
I only want to use something like :
#pragma omp parallel for reduction(max : max_value)
which should be very handy for lots of programming problems.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
>>I only want to use something like :
>>
>>#pragma omp parallel for reduction(max : max_value)I think you need to submit your proposal to a right place, for example http://www.openmp.org ( mailto:webmaster@openmp.org ).
This form is already defined by OpenMP 3.1, although I've seen it denied. OpenMP 4.0 RC2 also defines
#pragma omp parallel for simd reduction(max : max_value)
to specify explicitly that both simd and thread parallel optimizations are desired, as well as forms for simd without threaded parallelism.
I guess Intel compilers are waiting to implement reduction(max/min: ) until there is documented demand for the OpenMP 4 forms.
I've drafted a white paper on the omp simd forms already supported, as well as alternatives for those which aren't (e.g. C++ std::max(), std:max_element()....
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
gcc development branch accessed by e.g.
svn co svn://gcc.gnu.org/svn/gcc/branches/gomp-4_0-branch gcc-omp4
includes early support for OpenMP 3.1 and 4.0 max|min reductions. If it reaches sufficient maturity to be merged into gcc-4.9, this would indicate support for these features well before they appear in Intel C/C++. I had to use the configure --disable-werror option in order to build this gcc.
As indicated elsewhere, the libiomp5 and open source Intel OpenMP libraries already should include support for omp parallel reduction(min|max: ). It is not expected to appear in the icc release this week, although there would be several OpenMP 4.0 features. It should be possible to use linux clang as well as the gcc branch to test min|max reductions with the Intel and gomp libraries.
I didn't find any max|min reduction tests in gcc testsuite. There are some gfortran max|min reduction unit tests, but the gomp-4_0-branch doesn't appear to add any gfortran OpenMP 4 features.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page