- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I want to sum up a large array of 32-bit integers, e.g 10M integers. How to do this in a very fast way?
I am now trying to use Intel SSE Intrinsics to do this.
Each time, I load 4 integers by one instruction, and then sum up. So I only need to loop for N/4 times,where N is the array size.
Steps are asfollows:
Loop N/4 times
1, load 4 32-bit int, a0 a1 a2 a3
2, pack the data as:
xmm1 = a0 0 a1 0
xmm2 = a2 0 a3 0
3, sum up
xmm3 = a0+a2 0 a1+a3 0
Loop End
The problem is: two 32-bit integers sum up may result in overflow.
So how to add two 32-bit integers and save them to a 64-bit integer by using SSE intrinsics functions?
Thank you very much.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You will have to wait for AVX to see 4-wide parallel 64-bit adds. Then, if you trust Intel C, you get them from plain C source code:
long long int extsum(int *a, int n){
long long int sum = 0;
for(int i = 0; i < n; ++i)sum += a;
...
vpxor xmm0, xmm0, xmm0
add eax, edx
xor ecx, ecx
.B1.4: ; Preds .B1.4 .B
vpmovsxdq xmm1, QWORD PTR [esi+ecx*4]
vpmovsxdq xmm2, QWORD PTR [8+esi+ecx*4]
vpmovsxdq xmm4, QWORD PTR [16+esi+ecx*4]
vpmovsxdq xmm6, QWORD PTR [24+esi+ecx*4]
vpaddq xmm0, xmm0, xmm1
vpaddq xmm3, xmm0, xmm2
add ecx, 8
vpaddq xmm5, xmm3, xmm4
vpaddq xmm0, xmm5, xmm6
cmp ecx, eax
jb .B1.4 ; Prob 82%
.B1.5: ; Preds .B1.4
vpshufd xmm1, xmm0, 14
vpaddq xmm0, xmm0, xmm1
vmovd ecx, xmm0
vpsrlq xmm2, xmm0, 32
vmovd esi, xmm2
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks a lot for your quick reply.
Of course, using Intel C++ Compiler is a good choice. But I want to do the vectorization by myself.
In my opinion, it costs a lot of time to access the memory to read out the integers one by one.
So I hope to load 4 32-bit integers by one instruction. Then do the pack and sum using Intrinsics. But I don't know how to cast 32-bit integer to 64-bit integer in the 128-bit registers.
My code is like:
__int32 arr32[4] = {INT_MIN, INT_MIN, INT_MIN, INT_MIN};
__m128i* wrd_ptr = (__m128i*)arr32; // cast the pointer type
xmm1 = _mm_loadu_si128(wrd_ptr); // a0 a1 a2 a3
xmm2 = _mm_unpacklo_epi32(xmm1, xmm0); // a0 0 a1 0
xmm3 = _mm_unpackhi_epi32(xmm1, xmm0); // a2 0 a3 0
Problem: how to cast xmm2(4 32-bit int) to xmm2(2 64-bit int)??
If it is possible, then I can use the 64-bit addition like:
xmm4 = _mm_add_epi64(xmm2, xmm3);
Thank you!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You could use the intrinsic to convert a pair of 32-bit ints to double, and sum the doubles by _mm_add_pd, if you like that alternate approach to protecting against overflow. Any good compiler ought to choose those intrinsics automatically:
pxor xmm1, xmm1
movaps xmm0, xmm1
.B1.4: ; Preds .B1.4 .B1
cvtdq2pd xmm2, QWORD PTR [esi+eax*4]
cvtdq2pd xmm3, QWORD PTR [8+esi+eax*4]
cvtdq2pd xmm4, QWORD PTR [16+esi+eax*4]
cvtdq2pd xmm5, QWORD PTR [24+esi+eax*4]
addpd xmm0, xmm2
addpd xmm1, xmm3
addpd xmm0, xmm4
addpd xmm1, xmm5
add eax, 8
cmp eax, ecx
jb .B1.4 ; Prob 82%
.B1.5: ; Preds .B1.4
mov esi, DWORD PTR [8+esp]
addpd xmm0, xmm1
movaps xmm1, xmm0
unpckhpd xmm1, xmm0
addsd xmm0, xmm1
Note that the compiler chose to sum 8 elements per pass, ending up with 4 partial sums, which are combined into a single sum after the loop.
If you're talking about speeding up a disk performance bound operation to read the big array, intrinsics won't solve the bottleneck.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think I should try a good compiler to do the vectorization automatically.
Or, I may try the alternate approach to convert 32-bit int to double. But Iam afraidthis conversion might waste some time. So I am not sure whether the whole performance will increase.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page