Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

Optimizing XOR and Popcount using AVX2


We have to perform bit wise XOR operation on two arrays each containing 5 elements of uint64_t (unsigned long long) and then perform counting (pop count) of 1's.

What is the optimized way by using AVX2  256 bit wide YMM registers, AVX2 VPXOR and popcount to achieve this in minimum clock cycles. 

Right now we are doing this by following code snippet

for (j = 0; j < 5; j++){

                xorResult = cylinderArrayVectorA ^ cylinderArrayVectorB;

                noOfOnes = _mm_popcnt_u64(xorResult);

                sumOfOnes += noOfOnes;


0 Kudos
3 Replies
Black Belt

Are you looking to minimize latency to compute this for a single pair of 5-element vectors, or to maximize throughput for performing this operation on independent pairs of 5-element vectors?  

The minimum latency code may be the scalar code above.  The two loads can be issued in one cycle, with one of the loads as a memory argument to the XOR (which is is effectively free, since 4 ports are available to execute XOR with single-cycle latency), the popcount instruction has 3 cycle latency and can only be executed on one functional unit, and the add is effectively free (4 ports are available to execute ADD with single-cycle latency).  Minimum latency will be L1 data cache load latency (typically 4 cycles) plus about 16 cycles on the critical path.  There are only about 20 instructions in the fully unrolled loop (load, xor, popcnt, add), so the out-of-order engine will have lots of room to issue & execute the surrounding code concurrently with this loop.

AVX2 *might* help if you need to perform this operation on lots of independent pairs of 5-element vectors, but it might not help.  There is no vector population count instruction, so you would need to emulate this using bit tricks.  Examples are and ...; It is important to note that instructions operating on 256-bit AVX2 registers that enable data to cross between the upper and lower 128-bit "lanes" have 3-cycle latency, while instructions that only allow "horizontal" data motion within 128-bit "lanes" have single-cycle latency.  If I had to guess, I suspect that the best implementation would use two 128-bit registers for four of the elements and a GPR for the 5th element, but I have never analyzed a vectorized SW implementation of popcount....

Black Belt

Silly me -- popcount instruction is limited to a single unit, but the 5 popcounts are independent, so they don't need to execute sequentially.  The five popcounts can execute in a pipelined fashion, and the summation can be done in a tree provide additional overlap.   A simple implementation compiled in a separate function required 25 instructions (including passing pointers to the two arrays) and 1000 executions took an average time of about 9.6 cycles per call.

Black Belt

It would help greatly to see a wider picture of what is going on.

You should be made aware that getting the inner most loop to run as efficiently as possible, can be counter productive to the performance of its use in the outer scopes.

The above comment is in reference to John's comment to

Single pair of 320-bit (5x64), or
multiple pairs of 320-bit vectors.

Add to this mix, a single 320-bit with multiple 320-bit vectors.

When the mix is large enough, it may be more effective to turn your horizontal 5-wide uint64_t bit vectors into 5 separate 1-wide 64-bit lanes of the 320-bit logical vector, then work on 4 separate 320-bit vector pairs, with each different partial 320-bit vector spread across the 4 uint64_t lanes. (even I cannot make sense of the above).

A problem with the above is you have scalar addition of the sums. If you have many such pairs of 320-bit vectors. IOW the loop you provided above (presumably operating on uint64_t type variables) could be re-used to operate on 4 such pairs concurrently across the width of the AVX/AVX2 SIMD registers.

Jim Dempsey