- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi,

I would like to parallelize the computation of an histogram.

- Let's suppose that you have a std::vector<double> of elements in between 0 and 1 (think of a size of 1 billion)

- Let's suppose that you have an integer n known at runtime that subdivide [0, 1] in [0, 1/n[, [1/n, 2/n[, ..., [(n-1)/n, 1].

For every k in {0,1,...,n-1}, I would like to compute the number of elements of the vector in [k, (k+1)/n[. I have managed to program that in OpenMP, with an histogram for every thread and a summation of all the histogram at the end. However, as I want to use a Xeon Phi with many threads, I would like the summation at the end to be in log(nb_of_threads) instead of being linear. Is there a way to do that using OpenMP 4?

Best regards,

Francois

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

OpenMP 4 adds C max|min reductions to the previous options for parallel reduction(), as well as adding omp simd reductions (1 thread, using the simd parallelism). These are usually implemented as a tree reduction so might approximate your desire. Unfortunately (on KNC) I haven't seen my own parallel reduction usage showing a gain over omp parallel with a critical section so it is important not to use more threads than optimum (no more than 2 threads per cache tile... in my possibly quite different tests). If the basic omp reduction operations (simd or parallel) don't apply for your algorithm you may have to write out the tree reduction.

The simple critical section choice may speed up a reduction but would retain the likelihood of showing a linear time behavior.

Openmp 4 includes c array reduction. I've seen it work effectively for 1 or 2 threads, but otherwise the atomic is likely to be better.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Can you post your code?

A lot of performance difference will depend on how well vectorization is used.

Which Xeon Phi are you using?

Is this an offload or native application?

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

**Histogram**algorithm. You know that:

**1**. Its asymptotic complexity is

**O(N)**and the algorithm builds the histogram in one processing pass (!) over some data

**2**. This is a

**memory bound**algorithm and No any floating point operations are involved My data sets were even bigger than

**1 billion of elements**, that is,

**4 Giga Elements**( GE ) and

**8 GEs**. As soon as OpenMP threads ( let's say 4 hardware threads ) started "fighting" for a histogram bin, and you know that a critical section needs to be used in order to update a value in a bin correctly (!), performance of processing went down significantly. I came up to the conclusion that a number of

**Memory Channels**of a CPU needs to be taken into account. That is, No more than two hardware threads need to be used on a CPU with two

**Memory Channels**, No more than four hardware threads need to be used on a CPU with four

**Memory Channels**, and so on. Once again, this is because this is a

**memory bound**algorithm. However, because in my case a

**Single Threaded**version was so efficient it was decided that in a production code for a

**Pigeonhole Sorting**algorithm, for data sets with integers, No more than one thread needs to be used. Take into account that if a data set fits into RAM completely ( not RAM plus Virtual Memory! ) then a

**Single Threaded**version of the algorithm could satisfy all your performance requirements. Here are 2-year-old performance results for

**Pigeonhole Sorting**algorithm:

**1**. A data set of

**4 GEs**was sorted in

**~8.9**seconds

**2**. A data set of

**8 GEs**was sorted in

**~219.4**seconds ( Note: There was some negative performance impact of using Virtual Memory during processing ) Next, In case of the

**8 GEs**sorting test processing was

**18x**faster than a processing by a recursive

**Merge Sorting**algorithm for an identical (!) data set. Of course, you can try to do parallelization of

**Histogram**algorithm for R&D or academia purposes.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

**std::vector of elements in between 0 and 1**... - What kind of a distribution ( values between 0 and 1 ) you are going to have? - In order to build a Histogram you need to know in advance a total number of bins. It means, that with integers it is a much easier task, but you are going to have floating point values ( based on your initial requirements ). Like, [ 0.00000, 0.12345, 0.23456, 0.34567, 0.45678, ... ] and so on. - Declaring

**std::vector**as a type of single-precision data type ( float ) doesn't make any sence for me because it is not clear how many histogram bins need to be allocated before processing in order to build a histogram ( ...if you're going to add number of bins dynamically, by increasing the size of an array with bins, I can tell you this is Not going to work, in terms of performance, for a case when number of elements will be greater than 1 GEs... ). - Theoretically, it is possible to have a case when all elements in a source data set are different and in that case a size of the histogram will be equal to the size of the source data set and all bins in a histogram will have a value 1 ( ...looks simple, doesn't it?.. ). - Take into account that Histogram algorithms are very sensitive to processing overheads and that is why in my implementations of these algorithms there are No any STL codes ( only raw C-arrays and raw C-indexing ).

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Isn't this what the OpenMP ATOMIC construct is for?

The fundamental operation of a histogram is an atomic increment, which can be implemented in hardware fairly efficiently.

The OpenMP 4.0.0 standard section 2.12.6 describes the "atomic Construct". From the examples, it looks like you want all of the threads to update the histogram array using something like:

// Distribute array locations across threads #pragma omp parallel for for (i=0; i<ARRAY_SIZE; i++) { // Every threads processes all bins for (k=0; k<n; k++) { if (arrayis in the kth range) { #pragma omp atomic update histogram++; } }

This approach ought to be able to exploit the hardware's ability to do atomic updates, which should be much more efficient than a critical section.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi,

The code is meant to be used on Xeon an Xeon Phi KNL. It is mainly for an "academic" purpose as it is for a code modernization workshop. I understand that this code is memory bound but, it could be useful to launch many threads on a NUMA system, and a Xeon Phi KNL in quadrant mode should be considered as a quad-socket system.

Here is the code of the algorithm.

#include <chrono> #include <cstdio> #include <random> #include <omp.h> int main() { int nb_point{1000000000}; int nb_cluster{1000}; int nb_thread{omp_get_max_threads()}; float* x{new float[nb_point]}; int* index{new int[nb_point]}; float* sum{new float[nb_cluster]}; float* local_sum{new float[nb_cluster * nb_thread]}; std::default_random_engine generator{}; std::uniform_real_distribution<float> real_distribution{0.0f, 1.0f}; #pragma omp parallel for for (int i{0}; i < nb_point; ++i) { x= real_distribution(generator); index= i % nb_cluster; } for (int j{0}; j < nb_cluster; ++j) { sum= 0.0; } for (int j{0}; j < nb_cluster * nb_thread; ++j) { local_sum = 0.0; } auto begin = std::chrono::high_resolution_clock::now(); // Reduction #pragma omp parallel for for (int i{0}; i < nb_point; ++i) { int id{omp_get_thread_num()}; int j{index }; local_sum[id * nb_cluster + j] += x; } // Finalizing the reduction for (int id{0}; id < nb_thread; ++id) { for (int j{0}; j < nb_cluster; ++j) { sum+= local_sum[id * nb_cluster + j]; } } auto end = std::chrono::high_resolution_clock::now(); float time{1.0e-9 * std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin) .count()}; std::printf("Time total: %7.3f\n", time); delete[] local_sum; delete[] sum; delete[] index; delete[] x; return 0; }

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Minor clarification, a KNL in sub-NUMA clustering mode is configured as either 2 or 4 NUMA nodes. Quadrant mode is a uniform memory space.

Confess I don't see an advantage of this code for NUMA.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I see you code that builds the random distribution in x[], But I do not see the code inside the timed region that builds the histogram.

Why isn't this part of the timed region?

Note, line 38 (post #7) is not producing a histogram. The equivalent of

for(int i{0}; i < nb_point; ++i) { sum[(int)(x* ((double)nb_cluster))]++; } // Note, sum needs to be allocated to nb_cluster+1 // random range was defined as inclusive of 0.0 and 1.0

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Jim,

I agree that this is not an histogram. Instead of adding one to the "bin", we add the value. It looked to me close enough to the histogram algorithm that I have simplified things. Here, the part where we finalize the reduction is linear in the number of threads. I want to make it in log of the number of threads.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

On my KNL system, Time total: 0.073 seconds.

Hardly enough time to worry about for optimization.

You should have used:

index* = (int)(x *nb_cluster); // you may want to adjust rounding here*

I will leave it to you as an implementation detail as to if the number of indexes are nb_cluster or nb_cluster+1

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

*is in the kth range) >>{ >>*++;
>>}
>>}
John, If I wouldn't investigated

**#pragma omp atomic**update >>histogram**#pragma omp atomic**I would keep silence. However, I've tried different versions including

**#pragma omp critical**. All these versions did Not satisfy performance objectives. I will need to take a look at

**#pragma omp atomic update**clause.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page