Histogram: Manual reduction with OpenMP 4

velvia · ‎09-20-2016

Hi,

I would like to parallelize the computation of an histogram.

- Let's suppose that you have a std::vector<double> of elements in between 0 and 1 (think of a size of 1 billion)

- Let's suppose that you have an integer n known at runtime that subdivide [0, 1] in [0, 1/n[, [1/n, 2/n[, ..., [(n-1)/n, 1].

For every k in {0,1,...,n-1}, I would like to compute the number of elements of the vector in [k, (k+1)/n[. I have managed to program that in OpenMP, with an histogram for every thread and a summation of all the histogram at the end. However, as I want to use a Xeon Phi with many threads, I would like the summation at the end to be in log(nb_of_threads) instead of being linear. Is there a way to do that using OpenMP 4?

Best regards,

Francois

TimP · ‎09-20-2016

OpenMP 4 adds C max|min reductions to the previous options for parallel reduction(), as well as adding omp simd reductions (1 thread, using the simd parallelism). These are usually implemented as a tree reduction so might approximate your desire. Unfortunately (on KNC) I haven't seen my own parallel reduction usage showing a gain over omp parallel with a critical section so it is important not to use more threads than optimum (no more than 2 threads per cache tile... in my possibly quite different tests). If the basic omp reduction operations (simd or parallel) don't apply for your algorithm you may have to write out the tree reduction.

The simple critical section choice may speed up a reduction but would retain the likelihood of showing a linear time behavior.

Openmp 4 includes c array reduction. I've seen it work effectively for 1 or 2 threads, but otherwise the atomic is likely to be better.

jimdempseyatthecove · ‎09-20-2016

Can you post your code?
A lot of performance difference will depend on how well vectorization is used.
Which Xeon Phi are you using?
Is this an offload or native application?

Jim Dempsey

SergeyKostrov · ‎09-25-2016

>>I would like to parallelize the computation of an histogram. >> >>- Let's suppose that you have a std::vector of elements in between 0 and 1 (think of a size of 1 billion) I'd like to share my experience with that problem. Some time ago I've done an analysis on whether parallelization needs to be done for a Histogram algorithm. You know that: 1. Its asymptotic complexity is O(N) and the algorithm builds the histogram in one processing pass (!) over some data 2. This is a memory bound algorithm and No any floating point operations are involved My data sets were even bigger than 1 billion of elements, that is, 4 Giga Elements ( GE ) and 8 GEs. As soon as OpenMP threads ( let's say 4 hardware threads ) started "fighting" for a histogram bin, and you know that a critical section needs to be used in order to update a value in a bin correctly (!), performance of processing went down significantly. I came up to the conclusion that a number of Memory Channels of a CPU needs to be taken into account. That is, No more than two hardware threads need to be used on a CPU with two Memory Channels, No more than four hardware threads need to be used on a CPU with four Memory Channels, and so on. Once again, this is because this is a memory bound algorithm. However, because in my case a Single Threaded version was so efficient it was decided that in a production code for a Pigeonhole Sorting algorithm, for data sets with integers, No more than one thread needs to be used. Take into account that if a data set fits into RAM completely ( not RAM plus Virtual Memory! ) then a Single Threaded version of the algorithm could satisfy all your performance requirements. Here are 2-year-old performance results for Pigeonhole Sorting algorithm: 1. A data set of 4 GEs was sorted in ~8.9 seconds 2. A data set of 8 GEs was sorted in ~219.4 seconds ( Note: There was some negative performance impact of using Virtual Memory during processing ) Next, In case of the 8 GEs sorting test processing was 18x faster than a processing by a recursive Merge Sorting algorithm for an identical (!) data set. Of course, you can try to do parallelization of Histogram algorithm for R&D or academia purposes.

SergeyKostrov · ‎09-25-2016

A couple of more notes: >>...Let's suppose that you have a std::vector of elements in between 0 and 1... - What kind of a distribution ( values between 0 and 1 ) you are going to have? - In order to build a Histogram you need to know in advance a total number of bins. It means, that with integers it is a much easier task, but you are going to have floating point values ( based on your initial requirements ). Like, [ 0.00000, 0.12345, 0.23456, 0.34567, 0.45678, ... ] and so on. - Declaring std::vector as a type of single-precision data type ( float ) doesn't make any sence for me because it is not clear how many histogram bins need to be allocated before processing in order to build a histogram ( ...if you're going to add number of bins dynamically, by increasing the size of an array with bins, I can tell you this is Not going to work, in terms of performance, for a case when number of elements will be greater than 1 GEs... ). - Theoretically, it is possible to have a case when all elements in a source data set are different and in that case a size of the histogram will be equal to the size of the source data set and all bins in a histogram will have a value 1 ( ...looks simple, doesn't it?.. ). - Take into account that Histogram algorithms are very sensitive to processing overheads and that is why in my implementations of these algorithms there are No any STL codes ( only raw C-arrays and raw C-indexing ).

McCalpinJohn · ‎09-26-2016

Isn't this what the OpenMP ATOMIC construct is for?

The fundamental operation of a histogram is an atomic increment, which can be implemented in hardware fairly efficiently.

The OpenMP 4.0.0 standard section 2.12.6 describes the "atomic Construct". From the examples, it looks like you want all of the threads to update the histogram array using something like:

// Distribute array locations across threads
#pragma omp parallel for
for (i=0; i<ARRAY_SIZE; i++) {
    // Every threads processes all bins
    for (k=0; k<n; k++) {
        if (array is in the kth range) {
#pragma omp atomic update
               histogram++;
        }
}

This approach ought to be able to exploit the hardware's ability to do atomic updates, which should be much more efficient than a critical section.

velvia · ‎09-26-2016

Hi,

The code is meant to be used on Xeon an Xeon Phi KNL. It is mainly for an "academic" purpose as it is for a code modernization workshop. I understand that this code is memory bound but, it could be useful to launch many threads on a NUMA system, and a Xeon Phi KNL in quadrant mode should be considered as a quad-socket system.

Here is the code of the algorithm.

#include <chrono>
#include <cstdio>
#include <random>

#include <omp.h>

int main() {
  int nb_point{1000000000};
  int nb_cluster{1000};
  int nb_thread{omp_get_max_threads()};

  float* x{new float[nb_point]};
  int* index{new int[nb_point]};

  float* sum{new float[nb_cluster]};
  float* local_sum{new float[nb_cluster * nb_thread]};

  std::default_random_engine generator{};
  std::uniform_real_distribution<float> real_distribution{0.0f, 1.0f};
#pragma omp parallel for
  for (int i{0}; i < nb_point; ++i) {
    x = real_distribution(generator);
    index = i % nb_cluster;
  }
  for (int j{0}; j < nb_cluster; ++j) {
    sum = 0.0;
  }
  for (int j{0}; j < nb_cluster * nb_thread; ++j) {
    local_sum = 0.0;
  }

  auto begin = std::chrono::high_resolution_clock::now();
// Reduction
#pragma omp parallel for
  for (int i{0}; i < nb_point; ++i) {
    int id{omp_get_thread_num()};
    int j{index};
    local_sum[id * nb_cluster + j] += x;
  }

  // Finalizing the reduction
  for (int id{0}; id < nb_thread; ++id) {
    for (int j{0}; j < nb_cluster; ++j) {
      sum += local_sum[id * nb_cluster + j];
    }
  }
  auto end = std::chrono::high_resolution_clock::now();
  float time{1.0e-9 *
             std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin)
                 .count()};

  std::printf("Time total: %7.3f\n", time);

  delete[] local_sum;
  delete[] sum;
  delete[] index;
  delete[] x;

  return 0;
}

Gregg_S_Intel · ‎09-26-2016

Minor clarification, a KNL in sub-NUMA clustering mode is configured as either 2 or 4 NUMA nodes. Quadrant mode is a uniform memory space.

Confess I don't see an advantage of this code for NUMA.

jimdempseyatthecove · ‎09-26-2016

I see you code that builds the random distribution in x[], But I do not see the code inside the timed region that builds the histogram.

Why isn't this part of the timed region?

Note, line 38 (post #7) is not producing a histogram. The equivalent of

for(int i{0}; i < nb_point; ++i) {
  sum[(int)(x * ((double)nb_cluster))]++; }
  // Note, sum needs to be allocated to nb_cluster+1
  // random range was defined as inclusive of 0.0 and 1.0

Jim Dempsey

Francois_F_ · ‎09-26-2016

Hi Jim,

I agree that this is not an histogram. Instead of adding one to the "bin", we add the value. It looked to me close enough to the histogram algorithm that I have simplified things. Here, the part where we finalize the reduction is linear in the number of threads. I want to make it in log of the number of threads.

jimdempseyatthecove · ‎09-26-2016

On my KNL system, Time total: 0.073 seconds.

Hardly enough time to worry about for optimization.

You should have used:

index = (int)(x*nb_cluster); // you may want to adjust rounding here

I will leave it to you as an implementation detail as to if the number of indexes are nb_cluster or nb_cluster+1

Jim Dempsey

SergeyKostrov · ‎09-26-2016

>>// Distribute array locations across threads >>#pragma omp parallel for >>for (i=0; i>{ >>// Every threads processes all bins >>for (k=0; k>{ >>if (array is in the kth range) >>{ >>#pragma omp atomic update >>histogram++; >>} >>} John, If I wouldn't investigated #pragma omp atomic I would keep silence. However, I've tried different versions including #pragma omp critical. All these versions did Not satisfy performance objectives. I will need to take a look at #pragma omp atomic update clause.