How to sum up a large array of 32-bit integers in a fast way?

wheelsdong · ‎03-17-2010

Hi,

I want to sum up a large array of 32-bit integers, e.g 10M integers. How to do this in a very fast way?

I am now trying to use Intel SSE Intrinsics to do this.

Each time, I load 4 integers by one instruction, and then sum up. So I only need to loop for N/4 times,where N is the array size.

Steps are asfollows:
Loop N/4 times
1, load 4 32-bit int, a0 a1 a2 a3
2, pack the data as:
xmm1 = a0 0 a1 0
xmm2 = a2 0 a3 0
3, sum up
xmm3 = a0+a2 0 a1+a3 0
Loop End

The problem is: two 32-bit integers sum up may result in overflow.

So how to add two 32-bit integers and save them to a 64-bit integer by using SSE intrinsics functions?

Thank you very much.

TimP · ‎03-17-2010

Strictly speaking, of course, there are no original SSE instructions for 64-bit ints, and you can handle only pairs of 64-bit data under SSE2. In you have _mm_add_epi64, so you could extend the ints to 64-bit and pack them into an m128, then add, and at the end add the 2 64-bit ints.
You will have to wait for AVX to see 4-wide parallel 64-bit adds. Then, if you trust Intel C, you get them from plain C source code:
long long int extsum(int *a, int n){
long long int sum = 0;
for(int i = 0; i < n; ++i)sum += a;
...

vpxor xmm0, xmm0, xmm0
add eax, edx
xor ecx, ecx

.B1.4: ; Preds .B1.4 .B
vpmovsxdq xmm1, QWORD PTR [esi+ecx*4]
vpmovsxdq xmm2, QWORD PTR [8+esi+ecx*4]
vpmovsxdq xmm4, QWORD PTR [16+esi+ecx*4]
vpmovsxdq xmm6, QWORD PTR [24+esi+ecx*4]
vpaddq xmm0, xmm0, xmm1
vpaddq xmm3, xmm0, xmm2
add ecx, 8
vpaddq xmm5, xmm3, xmm4
vpaddq xmm0, xmm5, xmm6
cmp ecx, eax
jb .B1.4 ; Prob 82%

.B1.5: ; Preds .B1.4
vpshufd xmm1, xmm0, 14
vpaddq xmm0, xmm0, xmm1
vmovd ecx, xmm0
vpsrlq xmm2, xmm0, 32
vmovd esi, xmm2

wheelsdong · ‎03-17-2010

Hi Tim,

Thanks a lot for your quick reply.

Of course, using Intel C++ Compiler is a good choice. But I want to do the vectorization by myself.

In my opinion, it costs a lot of time to access the memory to read out the integers one by one.

So I hope to load 4 32-bit integers by one instruction. Then do the pack and sum using Intrinsics. But I don't know how to cast 32-bit integer to 64-bit integer in the 128-bit registers.

My code is like:

__int32 arr32[4] = {INT_MIN, INT_MIN, INT_MIN, INT_MIN};

__m128i* wrd_ptr = (__m128i*)arr32; // cast the pointer type

xmm1 = _mm_loadu_si128(wrd_ptr); // a0 a1 a2 a3

xmm2 = _mm_unpacklo_epi32(xmm1, xmm0); // a0 0 a1 0

xmm3 = _mm_unpackhi_epi32(xmm1, xmm0); // a2 0 a3 0

Problem: how to cast xmm2(4 32-bit int) to xmm2(2 64-bit int)??

If it is possible, then I can use the 64-bit addition like:

xmm4 = _mm_add_epi64(xmm2, xmm3);

Thank you!

TimP · ‎03-18-2010

Well, it's not very popular to do by intrinsics what most compilers will do well enough by source code, particularly if you aren't more interested in studying intrinsics than I am. If there were an intrinsic to convert a pair of 32-bit ints to 64-bit, you could use that, but I wouldn't count on it saving the time you expect.
You could use the intrinsic to convert a pair of 32-bit ints to double, and sum the doubles by _mm_add_pd, if you like that alternate approach to protecting against overflow. Any good compiler ought to choose those intrinsics automatically:
pxor xmm1, xmm1
movaps xmm0, xmm1

.B1.4: ; Preds .B1.4 .B1
cvtdq2pd xmm2, QWORD PTR [esi+eax*4]
cvtdq2pd xmm3, QWORD PTR [8+esi+eax*4]
cvtdq2pd xmm4, QWORD PTR [16+esi+eax*4]
cvtdq2pd xmm5, QWORD PTR [24+esi+eax*4]
addpd xmm0, xmm2
addpd xmm1, xmm3
addpd xmm0, xmm4
addpd xmm1, xmm5
add eax, 8
cmp eax, ecx
jb .B1.4 ; Prob 82%

.B1.5: ; Preds .B1.4
mov esi, DWORD PTR [8+esp]
addpd xmm0, xmm1
movaps xmm1, xmm0
unpckhpd xmm1, xmm0
addsd xmm0, xmm1

Note that the compiler chose to sum 8 elements per pass, ending up with 4 partial sums, which are combined into a single sum after the loop.

If you're talking about speeding up a disk performance bound operation to read the big array, intrinsics won't solve the bottleneck.

wheelsdong · ‎03-18-2010

Tim, thanks a lot for your suggestions.

I think I should try a good compiler to do the vectorization automatically.

Or, I may try the alternate approach to convert 32-bit int to double. But Iam afraidthis conversion might waste some time. So I am not sure whether the whole performance will increase.

TimP · ‎03-19-2010

You are taking a hit on performance by requiring the sum in extended precision. If you are using 32-bit mode, I would expect the double to be faster than 64-bit int, although I haven't tried such a comparison. If you don't promote to wider data type, of course you can use SSE intrinsics or an auto-vectorizing compiler, but you said this wouldn't do the job.