counting unique elements with AVX2 intrinsic functions

sunwoo · ‎07-04-2019

Hi All,

Is it possible to count unique values in a given array with AVX2 (or AVX) intrinsics?

For example,

For the given input data and count arrays,

data = [0, 4, 3, 2, 3, 2, 4, 4 ]; count = [0, 0, 0, 0, 0];

How can I get the following result (i.e., count[data[ii]]++) with AVX2 intrinsics ?

data = [0, 4, 3, 2, 3, 2, 4, 4]; count = [1, 0, 2, 2, 3]; // for 0, 1, 2, 3, 4

Note:

To make my question as simple as possible, I use a very small data.

However, in general, it can be any size and any range (e.g., [ -10, 20, 123457, -2346, 2456, ...])

FYI,

here is my current test code.

alignas(32) int dary[8] { 0, 4, 3, 2, 3, 2, 4, 4 }; // input values

alignas(32) int cary [5] { 0, 0, 0, 0, 0}; // counting array

__m256i vidx = _mm256_load_si256(reinterpret_cast<const __m256i *>(&dary); // 0, 4, 3, 2, 3, 2, 4, 4

__m256i cnts = _mm256_i32gather_epi32(cary, vidx, 4); // 0, 0, 0, 0, 0, 0, 0, 0

cnts =_mm256_add_epi32 (cnts, _mm256_set1_epi32(1)); // 1, 1, 1, 1, 1, 1, 1

// Now, I am stuck here !!??

Any help or suggestion will be a great help for me.

Many thanks,

Sunwoo

andysem · ‎07-05-2019

Though somewhat offtopic, but did you consider sorting the sequence and then counting the duplicates in the sorted sequence?

sunwoo · ‎07-06-2019

My apology if it's the right forum for the question. It's for implementing efficient counting index sort algorithm. So, I cannot sort the list before counting values of the input array. I was wondering whether or not there are AVX2 intrinsic functions or Instruction sets (e.g., gather, reduction, ...) to implement "count[data[ii]]++) "

McCalpinJohn · ‎07-07-2019

Histogram computation is difficult to vectorize because of the possibility of requiring multiple updates to the same histogram bin in a single instruction. The AVX-512 instruction set includes new instructions for "conflict detection" that can improve the performance of this computation. See https://software.intel.com/en-us/articles/improve-vectorization-performance-using-intel-advanced-vector-extensions-512

jimdempseyatthecove · ‎07-08-2019

A problem you have with scatter/gather is the number of counters must span Lowest Counted Number to Highest Counted Number inclusive. IOW the indices for scatter/gather must be valid indexes into the histogram array. The conflict issue that John raises is a real issue.

Any CPU that supports scatter/gather is also multi/many core. It might be better to use scalar instructions in a parallel region (all threads iterating full range of data), but where each thread counts only a subset of the values in data. IOW all threads reading the same data places a high probability of L1/L2/L3 hits for reads, and each thread having exclusive use of a subsection of counts avoids race conditions. This would be effective for large data (such as to amortize parallel startup overhead).

Jim Dempsey

McCalpinJohn · ‎07-09-2019

There are lots of tradeoffs here depending on the size of the input array, the size of the histogram array, the cache sizes, the memory sizes, the model(s) for parallelization, the performance requirements, etc.... A search on "HPCC RandomAccess benchmark optimization" should point to some of these issues for optimization of this algorithm for very large memory sizes on large clusters.... (NOTE: The HPCC RandomAccess benchmark actually allows you to get the wrong answer for some elements -- this makes vectorization easier, but also means that some of the implementations are not useful if you need the (exact) correct answer.)

jimdempseyatthecove · ‎07-10-2019

Additional hint on #5 second paragraph

When the data values are clumped into zones, splitting the value range into uniform pieces could result in only a few of your threads performing counter increment.

if(data >= lowRange && data < highRange) ++ count[data); // some threads may get starved

The following might be better

const int ncountInCacheLine = 64 / sizeof(count[0]); // number of data elements that fit in cache line
#pragma omp parallel // NOT parallel for
{
   int nThreads = omp_get_num_threads();
   int iTHread = omp_get_thread_num();
   for(int i=0; i<nData; ++i) // all threads iterate full range
   {
       // select for ++ these values
       if((data/ncountInCacheLine) % nThreads == iThread) ++count[data];
   }
}

You will have to test on representative data

Jim Dempsey

sunwoo · ‎07-10-2019

Hi John and Jim,

It seems almost impossible to build the SIMD version of unique array element counter and histogram based on current AVX2 ISA/intrinsic.

BTW, AVX512-CD (or, more generic one) is what I expected from AVX2 :-(

A multi-thread version without SIMD support is another good idea.

However, the sequential version seems (much) faster than the multi-thread version unless a given array size is so big such as 1 billions on AVX2 CPUs.

Thanks,

Sunwoo

McCalpinJohn · ‎07-11-2019

It should be easy to create a parallel version that does round-robin processing (by cache line) on the input vector (to help balance the load) and atomic updates to the histogram vector (to maintain correctness in the event of collisions). The modulo calculations should be moved out of the inner loop -- each thread will have its own start index and end index, and each thread will stride by OMP_NUM_THREADS cache lines.

The atomic updates could be done in a variety of ways -- the easiest is probably to load the target histogram element, increment the value, perform a locked compare and exchange, check to see if there was a collision, and (if there was a collision) repeat the (increment + compare and exchange) until it succeeds. If the locked compare and

The update could also be done in a TSX region (if your processor supports it), but that is probably not necessary for such a simple case.

andysem · ‎07-12-2019

Better avoid contention and atomic operations and maintain per-thread counters, which can then be accumulated together after all threads complete counting duplicates.

McCalpinJohn · ‎07-12-2019

Better avoid contention and atomic operations and maintain per-thread counters, which can then be accumulated together after all threads complete counting duplicates.

This would certainly be a win if the histogram array is small enough for the replicated private copies to have high hit rates in the private caches, and probably a win if the histogram array is small enough for the replicated private copies to have high hit rates in shared cache. On the other hand, if a single (global) histogram array fits into cache, but replicated histogram arrays overflow the caches, then the performance benefit is not quite so clear.

jimdempseyatthecove · ‎07-13-2019

The best algorithm will not be generic. The best algorithm will be determined by:

a) Number of items being counted
b) Number of counters (number of unique items)
c) Nature of items (dense identifiers, sparse identifiers, random identifiers, clumpy identifiers, bell curve identifiers, ...)

John>>The update could also be done in a TSX region...

Do you have some experience on the approximate number of cache lines that can be tracked per HW thread (which may differ for reads and writes)? This could potentially be a nice way to effectively expand the register set. By this I mean does a "memory" read/modify/write within a TSX region work faster than read/modify/write to memory location held in L1 cache? I suspect it does, particularly in the event of multiple r/m/w to the same cache line while holding the TSX region. Of course you would have to be careful not to perform too much work in the event that the O/S or some interrupt cause the transaction to abort (then re-do the calculation again).

Jim Dempsey