Intel® C++ Compiler
Support and discussions for creating C++ code that runs on platforms based on Intel® processors.
7696 Discussions

How to sum up a large array of 32-bit integers in a fast way?


I want to sum up a large array of 32-bit integers, e.g 10M integers. How to do this in a very fast way?

I am now trying to use Intel SSE Intrinsics to do this.

Each time, I load 4 integers by one instruction, and then sum up. So I only need to loop for N/4 times,where N is the array size.

Steps are asfollows:
Loop N/4 times
1, load 4 32-bit int, a0 a1 a2 a3
2, pack the data as:
xmm1 = a0 0 a1 0
xmm2 = a2 0 a3 0
3, sum up
xmm3 = a0+a2 0 a1+a3 0
Loop End

The problem is: two 32-bit integers sum up may result in overflow.

So how to add two 32-bit integers and save them to a 64-bit integer by using SSE intrinsics functions?

Thank you very much.

0 Kudos
5 Replies
Black Belt
Strictly speaking, of course, there are no original SSE instructions for 64-bit ints, and you can handle only pairs of 64-bit data under SSE2. In you have _mm_add_epi64, so you could extend the ints to 64-bit and pack them into an m128, then add, and at the end add the 2 64-bit ints.
You will have to wait for AVX to see 4-wide parallel 64-bit adds. Then, if you trust Intel C, you get them from plain C source code:
long long int extsum(int *a, int n){
long long int sum = 0;
for(int i = 0; i < n; ++i)sum += a;

vpxor xmm0, xmm0, xmm0
add eax, edx
xor ecx, ecx

.B1.4: ; Preds .B1.4 .B
vpmovsxdq xmm1, QWORD PTR [esi+ecx*4]
vpmovsxdq xmm2, QWORD PTR [8+esi+ecx*4]
vpmovsxdq xmm4, QWORD PTR [16+esi+ecx*4]
vpmovsxdq xmm6, QWORD PTR [24+esi+ecx*4]
vpaddq xmm0, xmm0, xmm1
vpaddq xmm3, xmm0, xmm2
add ecx, 8
vpaddq xmm5, xmm3, xmm4
vpaddq xmm0, xmm5, xmm6
cmp ecx, eax
jb .B1.4 ; Prob 82%

.B1.5: ; Preds .B1.4
vpshufd xmm1, xmm0, 14
vpaddq xmm0, xmm0, xmm1
vmovd ecx, xmm0
vpsrlq xmm2, xmm0, 32
vmovd esi, xmm2

Hi Tim,

Thanks a lot for your quick reply.

Of course, using Intel C++ Compiler is a good choice. But I want to do the vectorization by myself.

In my opinion, it costs a lot of time to access the memory to read out the integers one by one.

So I hope to load 4 32-bit integers by one instruction. Then do the pack and sum using Intrinsics. But I don't know how to cast 32-bit integer to 64-bit integer in the 128-bit registers.

My code is like:

__int32 arr32[4] = {INT_MIN, INT_MIN, INT_MIN, INT_MIN};

__m128i* wrd_ptr = (__m128i*)arr32; // cast the pointer type

xmm1 = _mm_loadu_si128(wrd_ptr); // a0 a1 a2 a3

xmm2 = _mm_unpacklo_epi32(xmm1, xmm0); // a0 0 a1 0

xmm3 = _mm_unpackhi_epi32(xmm1, xmm0); // a2 0 a3 0

Problem: how to cast xmm2(4 32-bit int) to xmm2(2 64-bit int)??

If it is possible, then I can use the 64-bit addition like:

xmm4 = _mm_add_epi64(xmm2, xmm3);

Thank you!

Black Belt
Well, it's not very popular to do by intrinsics what most compilers will do well enough by source code, particularly if you aren't more interested in studying intrinsics than I am. If there were an intrinsic to convert a pair of 32-bit ints to 64-bit, you could use that, but I wouldn't count on it saving the time you expect.
You could use the intrinsic to convert a pair of 32-bit ints to double, and sum the doubles by _mm_add_pd, if you like that alternate approach to protecting against overflow. Any good compiler ought to choose those intrinsics automatically:
pxor xmm1, xmm1
movaps xmm0, xmm1

.B1.4: ; Preds .B1.4 .B1
cvtdq2pd xmm2, QWORD PTR [esi+eax*4]
cvtdq2pd xmm3, QWORD PTR [8+esi+eax*4]
cvtdq2pd xmm4, QWORD PTR [16+esi+eax*4]
cvtdq2pd xmm5, QWORD PTR [24+esi+eax*4]
addpd xmm0, xmm2
addpd xmm1, xmm3
addpd xmm0, xmm4
addpd xmm1, xmm5
add eax, 8
cmp eax, ecx
jb .B1.4 ; Prob 82%

.B1.5: ; Preds .B1.4
mov esi, DWORD PTR [8+esp]
addpd xmm0, xmm1
movaps xmm1, xmm0
unpckhpd xmm1, xmm0
addsd xmm0, xmm1

Note that the compiler chose to sum 8 elements per pass, ending up with 4 partial sums, which are combined into a single sum after the loop.

If you're talking about speeding up a disk performance bound operation to read the big array, intrinsics won't solve the bottleneck.
Tim, thanks a lot for your suggestions.

I think I should try a good compiler to do the vectorization automatically.

Or, I may try the alternate approach to convert 32-bit int to double. But Iam afraidthis conversion might waste some time. So I am not sure whether the whole performance will increase.

Black Belt
You are taking a hit on performance by requiring the sum in extended precision. If you are using 32-bit mode, I would expect the double to be faster than 64-bit int, although I haven't tried such a comparison. If you don't promote to wider data type, of course you can use SSE intrinsics or an auto-vectorizing compiler, but you said this wouldn't do the job.