- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

I want to sum up a large array of 32-bit integers, e.g 10M integers. How to do this in a very fast way?

I am now trying to use Intel SSE Intrinsics to do this.

Each time, I load 4 integers by one instruction, and then sum up. So I only need to loop for N/4 times,where N is the array size.

Steps are asfollows:

Loop N/4 times

1, load 4 32-bit int, a0 a1 a2 a3

2, pack the data as:

xmm1 = a0 0 a1 0

xmm2 = a2 0 a3 0

3, sum up

xmm3 = a0+a2 0 a1+a3 0

Loop End

The problem is: two 32-bit integers sum up may result in overflow.

So how to add two 32-bit integers and save them to a 64-bit integer by using SSE intrinsics functions?

Thank you very much.

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

You will have to wait for AVX to see 4-wide parallel 64-bit adds. Then, if you trust Intel C, you get them from plain C source code:

long long int extsum(int *a, int n){

long long int sum = 0;

for(int i = 0; i < n; ++i)sum += a

*;*

...

vpxor xmm0, xmm0, xmm0

add eax, edx

xor ecx, ecx

.B1.4: ; Preds .B1.4 .B

vpmovsxdq xmm1, QWORD PTR [esi+ecx*4]

vpmovsxdq xmm2, QWORD PTR [8+esi+ecx*4]

vpmovsxdq xmm4, QWORD PTR [16+esi+ecx*4]

vpmovsxdq xmm6, QWORD PTR [24+esi+ecx*4]

vpaddq xmm0, xmm0, xmm1

vpaddq xmm3, xmm0, xmm2

add ecx, 8

vpaddq xmm5, xmm3, xmm4

vpaddq xmm0, xmm5, xmm6

cmp ecx, eax

jb .B1.4 ; Prob 82%

.B1.5: ; Preds .B1.4

vpshufd xmm1, xmm0, 14

vpaddq xmm0, xmm0, xmm1

vmovd ecx, xmm0

vpsrlq xmm2, xmm0, 32

vmovd esi, xmm2

...

vpxor xmm0, xmm0, xmm0

add eax, edx

xor ecx, ecx

.B1.4: ; Preds .B1.4 .B

vpmovsxdq xmm1, QWORD PTR [esi+ecx*4]

vpmovsxdq xmm2, QWORD PTR [8+esi+ecx*4]

vpmovsxdq xmm4, QWORD PTR [16+esi+ecx*4]

vpmovsxdq xmm6, QWORD PTR [24+esi+ecx*4]

vpaddq xmm0, xmm0, xmm1

vpaddq xmm3, xmm0, xmm2

add ecx, 8

vpaddq xmm5, xmm3, xmm4

vpaddq xmm0, xmm5, xmm6

cmp ecx, eax

jb .B1.4 ; Prob 82%

.B1.5: ; Preds .B1.4

vpshufd xmm1, xmm0, 14

vpaddq xmm0, xmm0, xmm1

vmovd ecx, xmm0

vpsrlq xmm2, xmm0, 32

vmovd esi, xmm2

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Thanks a lot for your quick reply.

Of course, using Intel C++ Compiler is a good choice. But I want to do the vectorization by myself.

In my opinion, it costs a lot of time to access the memory to read out the integers one by one.

So I hope to load 4 32-bit integers by one instruction. Then do the pack and sum using Intrinsics. But I don't know how to cast 32-bit integer to 64-bit integer in the 128-bit registers.

My code is like:

__int32 arr32[4] = {INT_MIN, INT_MIN, INT_MIN, INT_MIN};

__m128i* wrd_ptr = (__m128i*)arr32; // cast the pointer type

xmm1 = _mm_loadu_si128(wrd_ptr); // a0 a1 a2 a3

xmm2 = _mm_unpacklo_epi32(xmm1, xmm0); // a0 0 a1 0

xmm3 = _mm_unpackhi_epi32(xmm1, xmm0); // a2 0 a3 0

Problem: how to cast xmm2(4 32-bit int) to xmm2(2 64-bit int)??

If it is possible, then I can use the 64-bit addition like:

xmm4 = _mm_add_epi64(xmm2, xmm3);

Thank you!

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

You could use the intrinsic to convert a pair of 32-bit ints to double, and sum the doubles by _mm_add_pd, if you like that alternate approach to protecting against overflow. Any good compiler ought to choose those intrinsics automatically:

pxor xmm1, xmm1

movaps xmm0, xmm1

.B1.4: ; Preds .B1.4 .B1

cvtdq2pd xmm2, QWORD PTR [esi+eax*4]

cvtdq2pd xmm3, QWORD PTR [8+esi+eax*4]

cvtdq2pd xmm4, QWORD PTR [16+esi+eax*4]

cvtdq2pd xmm5, QWORD PTR [24+esi+eax*4]

addpd xmm0, xmm2

addpd xmm1, xmm3

addpd xmm0, xmm4

addpd xmm1, xmm5

add eax, 8

cmp eax, ecx

jb .B1.4 ; Prob 82%

.B1.5: ; Preds .B1.4

mov esi, DWORD PTR [8+esp]

addpd xmm0, xmm1

movaps xmm1, xmm0

unpckhpd xmm1, xmm0

addsd xmm0, xmm1

Note that the compiler chose to sum 8 elements per pass, ending up with 4 partial sums, which are combined into a single sum after the loop.

If you're talking about speeding up a disk performance bound operation to read the big array, intrinsics won't solve the bottleneck.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

I think I should try a good compiler to do the vectorization automatically.

Or, I may try the alternate approach to convert 32-bit int to double. But Iam afraidthis conversion might waste some time. So I am not sure whether the whole performance will increase.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page