- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi All,
Is it possible to count unique values in a given array with AVX2 (or AVX) intrinsics?
For example,
For the given input data and count arrays,
data = [0, 4, 3, 2, 3, 2, 4, 4 ]; count = [0, 0, 0, 0, 0];
How can I get the following result (i.e., count[data[ii]]++) with AVX2 intrinsics ?
data = [0, 4, 3, 2, 3, 2, 4, 4]; count = [1, 0, 2, 2, 3]; // for 0, 1, 2, 3, 4
Note:
To make my question as simple as possible, I use a very small data.
However, in general, it can be any size and any range (e.g., [ -10, 20, 123457, -2346, 2456, ...])
FYI,
here is my current test code.
alignas(32) int dary[8] { 0, 4, 3, 2, 3, 2, 4, 4 }; // input values
alignas(32) int cary [5] { 0, 0, 0, 0, 0}; // counting array
__m256i vidx = _mm256_load_si256(reinterpret_cast<const __m256i *>(&dary); // 0, 4, 3, 2, 3, 2, 4, 4
__m256i cnts = _mm256_i32gather_epi32(cary, vidx, 4); // 0, 0, 0, 0, 0, 0, 0, 0
cnts =_mm256_add_epi32 (cnts, _mm256_set1_epi32(1)); // 1, 1, 1, 1, 1, 1, 1
// Now, I am stuck here !!??
Any help or suggestion will be a great help for me.
Many thanks,
Sunwoo
- Tags:
- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Though somewhat offtopic, but did you consider sorting the sequence and then counting the duplicates in the sorted sequence?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My apology if it's the right forum for the question. It's for implementing efficient counting index sort algorithm. So, I cannot sort the list before counting values of the input array. I was wondering whether or not there are AVX2 intrinsic functions or Instruction sets (e.g., gather, reduction, ...) to implement "count[data[ii]]++) "
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Histogram computation is difficult to vectorize because of the possibility of requiring multiple updates to the same histogram bin in a single instruction. The AVX-512 instruction set includes new instructions for "conflict detection" that can improve the performance of this computation. See https://software.intel.com/en-us/articles/improve-vectorization-performance-using-intel-advanced-vector-extensions-512
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A problem you have with scatter/gather is the number of counters must span Lowest Counted Number to Highest Counted Number inclusive. IOW the indices for scatter/gather must be valid indexes into the histogram array. The conflict issue that John raises is a real issue.
Any CPU that supports scatter/gather is also multi/many core. It might be better to use scalar instructions in a parallel region (all threads iterating full range of data), but where each thread counts only a subset of the values in data. IOW all threads reading the same data places a high probability of L1/L2/L3 hits for reads, and each thread having exclusive use of a subsection of counts avoids race conditions. This would be effective for large data (such as to amortize parallel startup overhead).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There are lots of tradeoffs here depending on the size of the input array, the size of the histogram array, the cache sizes, the memory sizes, the model(s) for parallelization, the performance requirements, etc.... A search on "HPCC RandomAccess benchmark optimization" should point to some of these issues for optimization of this algorithm for very large memory sizes on large clusters.... (NOTE: The HPCC RandomAccess benchmark actually allows you to get the wrong answer for some elements -- this makes vectorization easier, but also means that some of the implementations are not useful if you need the (exact) correct answer.)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Additional hint on #5 second paragraph
When the data values are clumped into zones, splitting the value range into uniform pieces could result in only a few of your threads performing counter increment.
if(data >= lowRange && data < highRange) ++ count[data); // some threads may get starved
The following might be better
const int ncountInCacheLine = 64 / sizeof(count[0]); // number of data elements that fit in cache line #pragma omp parallel // NOT parallel for { int nThreads = omp_get_num_threads(); int iTHread = omp_get_thread_num(); for(int i=0; i<nData; ++i) // all threads iterate full range { // select for ++ these values if((data/ncountInCacheLine) % nThreads == iThread) ++count[data]; } }
You will have to test on representative data
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi John and Jim,
It seems almost impossible to build the SIMD version of unique array element counter and histogram based on current AVX2 ISA/intrinsic.
BTW, AVX512-CD (or, more generic one) is what I expected from AVX2 :-(
A multi-thread version without SIMD support is another good idea.
However, the sequential version seems (much) faster than the multi-thread version unless a given array size is so big such as 1 billions on AVX2 CPUs.
Thanks,
Sunwoo
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It should be easy to create a parallel version that does round-robin processing (by cache line) on the input vector (to help balance the load) and atomic updates to the histogram vector (to maintain correctness in the event of collisions). The modulo calculations should be moved out of the inner loop -- each thread will have its own start index and end index, and each thread will stride by OMP_NUM_THREADS cache lines.
The atomic updates could be done in a variety of ways -- the easiest is probably to load the target histogram element, increment the value, perform a locked compare and exchange, check to see if there was a collision, and (if there was a collision) repeat the (increment + compare and exchange) until it succeeds. If the locked compare and
The update could also be done in a TSX region (if your processor supports it), but that is probably not necessary for such a simple case.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Better avoid contention and atomic operations and maintain per-thread counters, which can then be accumulated together after all threads complete counting duplicates.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Better avoid contention and atomic operations and maintain per-thread counters, which can then be accumulated together after all threads complete counting duplicates.
This would certainly be a win if the histogram array is small enough for the replicated private copies to have high hit rates in the private caches, and probably a win if the histogram array is small enough for the replicated private copies to have high hit rates in shared cache. On the other hand, if a single (global) histogram array fits into cache, but replicated histogram arrays overflow the caches, then the performance benefit is not quite so clear.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The best algorithm will not be generic. The best algorithm will be determined by:
a) Number of items being counted
b) Number of counters (number of unique items)
c) Nature of items (dense identifiers, sparse identifiers, random identifiers, clumpy identifiers, bell curve identifiers, ...)
John>>The update could also be done in a TSX region...
Do you have some experience on the approximate number of cache lines that can be tracked per HW thread (which may differ for reads and writes)? This could potentially be a nice way to effectively expand the register set. By this I mean does a "memory" read/modify/write within a TSX region work faster than read/modify/write to memory location held in L1 cache? I suspect it does, particularly in the event of multiple r/m/w to the same cache line while holding the TSX region. Of course you would have to be careful not to perform too much work in the event that the O/S or some interrupt cause the transaction to abort (then re-do the calculation again).
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page