- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I would like to parallelize the computation of an histogram.
- Let's suppose that you have a std::vector<double> of elements in between 0 and 1 (think of a size of 1 billion)
- Let's suppose that you have an integer n known at runtime that subdivide [0, 1] in [0, 1/n[, [1/n, 2/n[, ..., [(n-1)/n, 1].
For every k in {0,1,...,n-1}, I would like to compute the number of elements of the vector in [k, (k+1)/n[. I have managed to program that in OpenMP, with an histogram for every thread and a summation of all the histogram at the end. However, as I want to use a Xeon Phi with many threads, I would like the summation at the end to be in log(nb_of_threads) instead of being linear. Is there a way to do that using OpenMP 4?
Best regards,
Francois
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
OpenMP 4 adds C max|min reductions to the previous options for parallel reduction(), as well as adding omp simd reductions (1 thread, using the simd parallelism). These are usually implemented as a tree reduction so might approximate your desire. Unfortunately (on KNC) I haven't seen my own parallel reduction usage showing a gain over omp parallel with a critical section so it is important not to use more threads than optimum (no more than 2 threads per cache tile... in my possibly quite different tests). If the basic omp reduction operations (simd or parallel) don't apply for your algorithm you may have to write out the tree reduction.
The simple critical section choice may speed up a reduction but would retain the likelihood of showing a linear time behavior.
Openmp 4 includes c array reduction. I've seen it work effectively for 1 or 2 threads, but otherwise the atomic is likely to be better.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can you post your code?
A lot of performance difference will depend on how well vectorization is used.
Which Xeon Phi are you using?
Is this an offload or native application?
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Isn't this what the OpenMP ATOMIC construct is for?
The fundamental operation of a histogram is an atomic increment, which can be implemented in hardware fairly efficiently.
The OpenMP 4.0.0 standard section 2.12.6 describes the "atomic Construct". From the examples, it looks like you want all of the threads to update the histogram array using something like:
// Distribute array locations across threads #pragma omp parallel for for (i=0; i<ARRAY_SIZE; i++) { // Every threads processes all bins for (k=0; k<n; k++) { if (array is in the kth range) { #pragma omp atomic update histogram++; } }
This approach ought to be able to exploit the hardware's ability to do atomic updates, which should be much more efficient than a critical section.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
The code is meant to be used on Xeon an Xeon Phi KNL. It is mainly for an "academic" purpose as it is for a code modernization workshop. I understand that this code is memory bound but, it could be useful to launch many threads on a NUMA system, and a Xeon Phi KNL in quadrant mode should be considered as a quad-socket system.
Here is the code of the algorithm.
#include <chrono> #include <cstdio> #include <random> #include <omp.h> int main() { int nb_point{1000000000}; int nb_cluster{1000}; int nb_thread{omp_get_max_threads()}; float* x{new float[nb_point]}; int* index{new int[nb_point]}; float* sum{new float[nb_cluster]}; float* local_sum{new float[nb_cluster * nb_thread]}; std::default_random_engine generator{}; std::uniform_real_distribution<float> real_distribution{0.0f, 1.0f}; #pragma omp parallel for for (int i{0}; i < nb_point; ++i) { x = real_distribution(generator); index = i % nb_cluster; } for (int j{0}; j < nb_cluster; ++j) { sum= 0.0; } for (int j{0}; j < nb_cluster * nb_thread; ++j) { local_sum = 0.0; } auto begin = std::chrono::high_resolution_clock::now(); // Reduction #pragma omp parallel for for (int i{0}; i < nb_point; ++i) { int id{omp_get_thread_num()}; int j{index}; local_sum[id * nb_cluster + j] += x; } // Finalizing the reduction for (int id{0}; id < nb_thread; ++id) { for (int j{0}; j < nb_cluster; ++j) { sum += local_sum[id * nb_cluster + j]; } } auto end = std::chrono::high_resolution_clock::now(); float time{1.0e-9 * std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin) .count()}; std::printf("Time total: %7.3f\n", time); delete[] local_sum; delete[] sum; delete[] index; delete[] x; return 0; }
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Minor clarification, a KNL in sub-NUMA clustering mode is configured as either 2 or 4 NUMA nodes. Quadrant mode is a uniform memory space.
Confess I don't see an advantage of this code for NUMA.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I see you code that builds the random distribution in x[], But I do not see the code inside the timed region that builds the histogram.
Why isn't this part of the timed region?
Note, line 38 (post #7) is not producing a histogram. The equivalent of
for(int i{0}; i < nb_point; ++i) { sum[(int)(x * ((double)nb_cluster))]++; } // Note, sum needs to be allocated to nb_cluster+1 // random range was defined as inclusive of 0.0 and 1.0
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim,
I agree that this is not an histogram. Instead of adding one to the "bin", we add the value. It looked to me close enough to the histogram algorithm that I have simplified things. Here, the part where we finalize the reduction is linear in the number of threads. I want to make it in log of the number of threads.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
On my KNL system, Time total: 0.073 seconds.
Hardly enough time to worry about for optimization.
You should have used:
index = (int)(x*nb_cluster); // you may want to adjust rounding here
I will leave it to you as an implementation detail as to if the number of indexes are nb_cluster or nb_cluster+1
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page