Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software Development Tools (Compilers, Debuggers, Profilers & Analyzers)
- Intel® C++ Compiler
- How to sum up a large array of 32-bit integers in a fast way?

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Highlighted
##

Hi,

wheelsdong

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-17-2010
08:30 PM

107 Views

How to sum up a large array of 32-bit integers in a fast way?

I want to sum up a large array of 32-bit integers, e.g 10M integers. How to do this in a very fast way?

I am now trying to use Intel SSE Intrinsics to do this.

Each time, I load 4 integers by one instruction, and then sum up. So I only need to loop for N/4 times,where N is the array size.

Steps are asfollows:

Loop N/4 times

1, load 4 32-bit int, a0 a1 a2 a3

2, pack the data as:

xmm1 = a0 0 a1 0

xmm2 = a2 0 a3 0

3, sum up

xmm3 = a0+a2 0 a1+a3 0

Loop End

The problem is: two 32-bit integers sum up may result in overflow.

So how to add two 32-bit integers and save them to a 64-bit integer by using SSE intrinsics functions?

Thank you very much.

5 Replies

Highlighted
##

Strictly speaking, of course, there are no original SSE instructions for 64-bit ints, and you can handle only pairs of 64-bit data under SSE2. In you have _mm_add_epi64, so you could extend the ints to 64-bit and pack them into an m128, then add, and at the end add the 2 64-bit ints.

You will have to wait for AVX to see 4-wide parallel 64-bit adds. Then, if you trust Intel C, you get them from plain C source code:

long long int extsum(int *a, int n){

long long int sum = 0;

for(int i = 0; i < n; ++i)sum += a*;*

...

vpxor xmm0, xmm0, xmm0

add eax, edx

xor ecx, ecx

.B1.4: ; Preds .B1.4 .B

vpmovsxdq xmm1, QWORD PTR [esi+ecx*4]

vpmovsxdq xmm2, QWORD PTR [8+esi+ecx*4]

vpmovsxdq xmm4, QWORD PTR [16+esi+ecx*4]

vpmovsxdq xmm6, QWORD PTR [24+esi+ecx*4]

vpaddq xmm0, xmm0, xmm1

vpaddq xmm3, xmm0, xmm2

add ecx, 8

vpaddq xmm5, xmm3, xmm4

vpaddq xmm0, xmm5, xmm6

cmp ecx, eax

jb .B1.4 ; Prob 82%

.B1.5: ; Preds .B1.4

vpshufd xmm1, xmm0, 14

vpaddq xmm0, xmm0, xmm1

vmovd ecx, xmm0

vpsrlq xmm2, xmm0, 32

vmovd esi, xmm2

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-17-2010
09:48 PM

107 Views

You will have to wait for AVX to see 4-wide parallel 64-bit adds. Then, if you trust Intel C, you get them from plain C source code:

long long int extsum(int *a, int n){

long long int sum = 0;

for(int i = 0; i < n; ++i)sum += a

...

vpxor xmm0, xmm0, xmm0

add eax, edx

xor ecx, ecx

.B1.4: ; Preds .B1.4 .B

vpmovsxdq xmm1, QWORD PTR [esi+ecx*4]

vpmovsxdq xmm2, QWORD PTR [8+esi+ecx*4]

vpmovsxdq xmm4, QWORD PTR [16+esi+ecx*4]

vpmovsxdq xmm6, QWORD PTR [24+esi+ecx*4]

vpaddq xmm0, xmm0, xmm1

vpaddq xmm3, xmm0, xmm2

add ecx, 8

vpaddq xmm5, xmm3, xmm4

vpaddq xmm0, xmm5, xmm6

cmp ecx, eax

jb .B1.4 ; Prob 82%

.B1.5: ; Preds .B1.4

vpshufd xmm1, xmm0, 14

vpaddq xmm0, xmm0, xmm1

vmovd ecx, xmm0

vpsrlq xmm2, xmm0, 32

vmovd esi, xmm2

Highlighted
##

Hi Tim,

Thanks a lot for your quick reply.

Of course, using Intel C++ Compiler is a good choice. But I want to do the vectorization by myself.

In my opinion, it costs a lot of time to access the memory to read out the integers one by one.

So I hope to load 4 32-bit integers by one instruction. Then do the pack and sum using Intrinsics. But I don't know how to cast 32-bit integer to 64-bit integer in the 128-bit registers.

My code is like:

wheelsdong

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-17-2010
10:26 PM

107 Views

Thanks a lot for your quick reply.

Of course, using Intel C++ Compiler is a good choice. But I want to do the vectorization by myself.

In my opinion, it costs a lot of time to access the memory to read out the integers one by one.

So I hope to load 4 32-bit integers by one instruction. Then do the pack and sum using Intrinsics. But I don't know how to cast 32-bit integer to 64-bit integer in the 128-bit registers.

My code is like:

__int32 arr32[4] = {INT_MIN, INT_MIN, INT_MIN, INT_MIN};

__m128i* wrd_ptr = (__m128i*)arr32; // cast the pointer type

xmm1 = _mm_loadu_si128(wrd_ptr); // a0 a1 a2 a3

xmm2 = _mm_unpacklo_epi32(xmm1, xmm0); // a0 0 a1 0

xmm3 = _mm_unpackhi_epi32(xmm1, xmm0); // a2 0 a3 0

Problem: how to cast xmm2(4 32-bit int) to xmm2(2 64-bit int)??

If it is possible, then I can use the 64-bit addition like:

xmm4 = _mm_add_epi64(xmm2, xmm3);

Thank you!

Highlighted
##

Well, it's not very popular to do by intrinsics what most compilers will do well enough by source code, particularly if you aren't more interested in studying intrinsics than I am. If there were an intrinsic to convert a pair of 32-bit ints to 64-bit, you could use that, but I wouldn't count on it saving the time you expect.

You could use the intrinsic to convert a pair of 32-bit ints to double, and sum the doubles by _mm_add_pd, if you like that alternate approach to protecting against overflow. Any good compiler ought to choose those intrinsics automatically:

pxor xmm1, xmm1

movaps xmm0, xmm1

.B1.4: ; Preds .B1.4 .B1

cvtdq2pd xmm2, QWORD PTR [esi+eax*4]

cvtdq2pd xmm3, QWORD PTR [8+esi+eax*4]

cvtdq2pd xmm4, QWORD PTR [16+esi+eax*4]

cvtdq2pd xmm5, QWORD PTR [24+esi+eax*4]

addpd xmm0, xmm2

addpd xmm1, xmm3

addpd xmm0, xmm4

addpd xmm1, xmm5

add eax, 8

cmp eax, ecx

jb .B1.4 ; Prob 82%

.B1.5: ; Preds .B1.4

mov esi, DWORD PTR [8+esp]

addpd xmm0, xmm1

movaps xmm1, xmm0

unpckhpd xmm1, xmm0

addsd xmm0, xmm1

Note that the compiler chose to sum 8 elements per pass, ending up with 4 partial sums, which are combined into a single sum after the loop.

If you're talking about speeding up a disk performance bound operation to read the big array, intrinsics won't solve the bottleneck.

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-18-2010
07:53 AM

107 Views

You could use the intrinsic to convert a pair of 32-bit ints to double, and sum the doubles by _mm_add_pd, if you like that alternate approach to protecting against overflow. Any good compiler ought to choose those intrinsics automatically:

pxor xmm1, xmm1

movaps xmm0, xmm1

.B1.4: ; Preds .B1.4 .B1

cvtdq2pd xmm2, QWORD PTR [esi+eax*4]

cvtdq2pd xmm3, QWORD PTR [8+esi+eax*4]

cvtdq2pd xmm4, QWORD PTR [16+esi+eax*4]

cvtdq2pd xmm5, QWORD PTR [24+esi+eax*4]

addpd xmm0, xmm2

addpd xmm1, xmm3

addpd xmm0, xmm4

addpd xmm1, xmm5

add eax, 8

cmp eax, ecx

jb .B1.4 ; Prob 82%

.B1.5: ; Preds .B1.4

mov esi, DWORD PTR [8+esp]

addpd xmm0, xmm1

movaps xmm1, xmm0

unpckhpd xmm1, xmm0

addsd xmm0, xmm1

Note that the compiler chose to sum 8 elements per pass, ending up with 4 partial sums, which are combined into a single sum after the loop.

If you're talking about speeding up a disk performance bound operation to read the big array, intrinsics won't solve the bottleneck.

Highlighted
##

Tim, thanks a lot for your suggestions.

I think I should try a good compiler to do the vectorization automatically.

Or, I may try the alternate approach to convert 32-bit int to double. But Iam afraidthis conversion might waste some time. So I am not sure whether the whole performance will increase.

wheelsdong

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-18-2010
07:15 PM

107 Views

I think I should try a good compiler to do the vectorization automatically.

Or, I may try the alternate approach to convert 32-bit int to double. But Iam afraidthis conversion might waste some time. So I am not sure whether the whole performance will increase.

Highlighted
##

You are taking a hit on performance by requiring the sum in extended precision. If you are using 32-bit mode, I would expect the double to be faster than 64-bit int, although I haven't tried such a comparison. If you don't promote to wider data type, of course you can use SSE intrinsics or an auto-vectorizing compiler, but you said this wouldn't do the job.

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-19-2010
06:51 AM

107 Views

For more complete information about compiler optimizations, see our Optimization Notice.