Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software
- Software Development Tools (Compilers, Debuggers, Profilers & Analyzers)
- Intel® C++ Compiler
- Histogram: Manual reduction with OpenMP 4

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

velvia

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

09-20-2016
06:38 AM

462 Views

Histogram: Manual reduction with OpenMP 4

Hi,

I would like to parallelize the computation of an histogram.

- Let's suppose that you have a std::vector<double> of elements in between 0 and 1 (think of a size of 1 billion)

- Let's suppose that you have an integer n known at runtime that subdivide [0, 1] in [0, 1/n[, [1/n, 2/n[, ..., [(n-1)/n, 1].

For every k in {0,1,...,n-1}, I would like to compute the number of elements of the vector in [k, (k+1)/n[. I have managed to program that in OpenMP, with an histogram for every thread and a summation of all the histogram at the end. However, as I want to use a Xeon Phi with many threads, I would like the summation at the end to be in log(nb_of_threads) instead of being linear. Is there a way to do that using OpenMP 4?

Best regards,

Francois

Link Copied

11 Replies

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

09-20-2016
08:24 AM

462 Views

OpenMP 4 adds C max|min reductions to the previous options for parallel reduction(), as well as adding omp simd reductions (1 thread, using the simd parallelism). These are usually implemented as a tree reduction so might approximate your desire. Unfortunately (on KNC) I haven't seen my own parallel reduction usage showing a gain over omp parallel with a critical section so it is important not to use more threads than optimum (no more than 2 threads per cache tile... in my possibly quite different tests). If the basic omp reduction operations (simd or parallel) don't apply for your algorithm you may have to write out the tree reduction.

The simple critical section choice may speed up a reduction but would retain the likelihood of showing a linear time behavior.

Openmp 4 includes c array reduction. I've seen it work effectively for 1 or 2 threads, but otherwise the atomic is likely to be better.

jimdempseyatthecove

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

09-20-2016
09:07 AM

462 Views

Can you post your code?

A lot of performance difference will depend on how well vectorization is used.

Which Xeon Phi are you using?

Is this an offload or native application?

Jim Dempsey

SergeyKostrov

Valued Contributor II

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

09-25-2016
05:55 PM

462 Views

SergeyKostrov

Valued Contributor II

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

09-25-2016
09:50 PM

462 Views

McCalpinJohn

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

09-26-2016
07:54 AM

462 Views

Isn't this what the OpenMP ATOMIC construct is for?

The fundamental operation of a histogram is an atomic increment, which can be implemented in hardware fairly efficiently.

The OpenMP 4.0.0 standard section 2.12.6 describes the "atomic Construct". From the examples, it looks like you want all of the threads to update the histogram array using something like:

// Distribute array locations across threads #pragma omp parallel for for (i=0; i<ARRAY_SIZE; i++) { // Every threads processes all bins for (k=0; k<n; k++) { if (arrayis in the kth range) { #pragma omp atomic update histogram++; } }

This approach ought to be able to exploit the hardware's ability to do atomic updates, which should be much more efficient than a critical section.

velvia

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

09-26-2016
09:15 AM

462 Views

Hi,

The code is meant to be used on Xeon an Xeon Phi KNL. It is mainly for an "academic" purpose as it is for a code modernization workshop. I understand that this code is memory bound but, it could be useful to launch many threads on a NUMA system, and a Xeon Phi KNL in quadrant mode should be considered as a quad-socket system.

Here is the code of the algorithm.

#include <chrono> #include <cstdio> #include <random> #include <omp.h> int main() { int nb_point{1000000000}; int nb_cluster{1000}; int nb_thread{omp_get_max_threads()}; float* x{new float[nb_point]}; int* index{new int[nb_point]}; float* sum{new float[nb_cluster]}; float* local_sum{new float[nb_cluster * nb_thread]}; std::default_random_engine generator{}; std::uniform_real_distribution<float> real_distribution{0.0f, 1.0f}; #pragma omp parallel for for (int i{0}; i < nb_point; ++i) { x= real_distribution(generator); index= i % nb_cluster; } for (int j{0}; j < nb_cluster; ++j) { sum= 0.0; } for (int j{0}; j < nb_cluster * nb_thread; ++j) { local_sum = 0.0; } auto begin = std::chrono::high_resolution_clock::now(); // Reduction #pragma omp parallel for for (int i{0}; i < nb_point; ++i) { int id{omp_get_thread_num()}; int j{index }; local_sum[id * nb_cluster + j] += x; } // Finalizing the reduction for (int id{0}; id < nb_thread; ++id) { for (int j{0}; j < nb_cluster; ++j) { sum+= local_sum[id * nb_cluster + j]; } } auto end = std::chrono::high_resolution_clock::now(); float time{1.0e-9 * std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin) .count()}; std::printf("Time total: %7.3f\n", time); delete[] local_sum; delete[] sum; delete[] index; delete[] x; return 0; }

Gregg_S_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

09-26-2016
09:54 AM

462 Views

Minor clarification, a KNL in sub-NUMA clustering mode is configured as either 2 or 4 NUMA nodes. Quadrant mode is a uniform memory space.

Confess I don't see an advantage of this code for NUMA.

jimdempseyatthecove

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

09-26-2016
10:34 AM

462 Views

I see you code that builds the random distribution in x[], But I do not see the code inside the timed region that builds the histogram.

Why isn't this part of the timed region?

Note, line 38 (post #7) is not producing a histogram. The equivalent of

for(int i{0}; i < nb_point; ++i) { sum[(int)(x* ((double)nb_cluster))]++; } // Note, sum needs to be allocated to nb_cluster+1 // random range was defined as inclusive of 0.0 and 1.0

Jim Dempsey

Francois_F_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

09-26-2016
12:34 PM

462 Views

Hi Jim,

I agree that this is not an histogram. Instead of adding one to the "bin", we add the value. It looked to me close enough to the histogram algorithm that I have simplified things. Here, the part where we finalize the reduction is linear in the number of threads. I want to make it in log of the number of threads.

jimdempseyatthecove

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

09-26-2016
04:28 PM

462 Views

On my KNL system, Time total: 0.073 seconds.

Hardly enough time to worry about for optimization.

You should have used:

index* = (int)(x *nb_cluster); // you may want to adjust rounding here*

I will leave it to you as an implementation detail as to if the number of indexes are nb_cluster or nb_cluster+1

Jim Dempsey

SergeyKostrov

Valued Contributor II

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

09-26-2016
06:18 PM

462 Views

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

For more complete information about compiler optimizations, see our Optimization Notice.