- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am wondering if there is any way to vectorize the below loop?
#define NUMEL 10000000
int a[NUMEL];
for (int i = 1; i < NUMEL; i++){
a[i] += a[i-1];
}
I guess it needs a SIMD operation like below to be able to do that.
Please let me know if that is possible.
Regards
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
One way is to use intrinsic instructions:
1) load 8-wide vector into register
2) shift register one lane right (shifting in 0) into second register
3) add to first register
4) repeat 2,3 until no remaining original cells (omit last add)
5) store result
6) save the result in a 3rd register
7) fetch the next vector of data
9) store
10) goto 7 until no more data
Note, the above operation has only two memory references per vector of memory.
If you are a student, please attribute this method (your professor likely reads this too).
Jim Dempsey
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
One way is to use intrinsic instructions:
1) load 8-wide vector into register
2) shift register one lane right (shifting in 0) into second register
3) add to first register
4) repeat 2,3 until no remaining original cells (omit last add)
5) store result
6) save the result in a 3rd register
7) fetch the next vector of data
9) store
10) goto 7 until no more data
Note, the above operation has only two memory references per vector of memory.
If you are a student, please attribute this method (your professor likely reads this too).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This article may be of interest:
https://software.intel.com/content/www/us/en/develop/articles/elusive-algorithms-parallel-scan.html
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim,
Thanks for the hint. I am trying the implement the pseudo code you mentioned. However, I can't find single/double precision shift intrinsic to achieve this. So, I get casting errors from and to __m128i.
Please let me know your comments.
Regards
Below is the code:
#include <iostream>
#include <emmintrin.h>
#include <immintrin.h>
int main(){
float *a = (float*) _mm_malloc((1 << 20)*sizeof(float), 512/8);
float *as = (float*) _mm_malloc((1 << 20)*sizeof(float), 512/8);
float seri_sum, vec_sum, vec_parallel_sum;
for (int i = 1; i < (1 << 20); i++){
as[i] += as[i-1];
}
seri_sum = as[(1 << 20)-1];
std::cout << "seri_sum" << " = " << seri_sum << std::endl;
float sum = 0;
for(int i = 0; i < (1 << 20); i+=((128)/(sizeof(float)*8))){
__m128 vsum;
vsum = _mm_load_ps1(&sum);
__m128 vec;
vec = _mm_load_ps(&a[i]);
vsum = _mm_add_ps(vsum,vec);
for(int j = 1; j < (1 << 20); i++){
vec = (__m128) _mm_slli_epi32((__m128i) vec,1);
vsum = _mm_add_ps(vsum,vec);
}
_mm_store_ps(&a[i],vsum);
sum = a[i+((128)/(sizeof(float)*8))-1];
}
_mm_free(a);
_mm_free(as);
return 0;
}
And below is the compile error
C:\tmp\simdtests>make clean test_vec
rm -f *.exp *.lib *.supp *.exe *.obj *.optrpt
icl.exe -Qopt-report:5 -Qopt-report-phase:all /Ob0 -Qopenmp -Qsimd -Qopenmp-simd -arch:avx -Qdiag-error-limit:5 -c test_vec.cpp -Fo:test_vec.obj
Intel(R) C++ Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 19.1.2.254 Build 20200623
Copyright (C) 1985-2020 Intel Corporation. All rights reserved.
icl: remark #10397: optimization reports are generated in *.optrpt files in the output location
test_vec.cpp
test_vec.cpp(32): error: no suitable user-defined conversion from "__m128" to "__m128i" exists
vec = (__m128) _mm_slli_epi32((__m128i) vec,1);
^
test_vec.cpp(32): error: no suitable user-defined conversion from "__m128i" to "__m128" exists
vec = (__m128) _mm_slli_epi32((__m128i) vec,1);
^
compilation aborted for test_vec.cpp (code 2)
make: *** [Makefile:32: test_vec.obj] Error 2
C:\tmp\simdtests>
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Consult the Intel Intrinsics Guide, check the "Cast" checkbox:
Also, you are compiling with -arch:avx. Why are you using 128-bit vectors?
If your CPU has avx2, then consider __m256 _mm256_permutevar8x32_ps (__m256 a, __m256i idx)
Newer CPUs have AVX512 (and similar intrinsics under swizzle)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim,
Thanks so much for valuable comments.
I have a macro-ed version of the code to cover both SSE (up to 4.2), and AVX2 (AVX512 possibly for later).
#include <iostream>
#include <emmintrin.h>
#include <immintrin.h>
#define _XCAT(a,b) a ## b
#define XCAT(a,b) _XCAT(a,b)
#define XCAT3(a,b,c) XCAT(XCAT(a,b),c)
#define XCAT4(a,b,c,d) XCAT(XCAT3(a,b,c),d)
#define XCAT5(a,b,c,d,e) XCAT(XCAT4(a,b,c,d),e)
#define XCAT6(a,b,c,d,e,f) XCAT(XCAT5(a,b,c,d,e),f)
#define XCAT7(a,b,c,d,e,f,g) XCAT(XCAT6(a,b,c,d,e,f),g)
#define NROWS 10000
#define NCOLS 10000
#define printvar(x) std::cout << #x << " = " << x << std::endl;
// #define dtype double
// #define PP d
// #define NB 64
#define dtype float
#define PP s
#define NB 32
#ifdef __SSE2__
#define VBLEN (128)
#define AVX
#define VNB 128
#elif defined(__AVX__)
#define VBLEN (256)
#define AVX 256
#define VNB 256
#endif
#if NB == 32
#define VTYPE XCAT(__m,VNB)
#else
#define VTYPE XCAT(__m,VNB,PP)
#endif
#define VTYPEi XCAT(__m,VNB) ## i
#define VLEN (VBLEN/(sizeof(dtype)*8))
#define LOAD1(a) XCAT5(_mm,AVX,_load_p,PP,1)(a)
#define LOAD(a) XCAT4(_mm,AVX,_load_p,PP)(a)
#define LOADU(a) XCAT5(_mm,AVX,_loadu_p,PP)(a)
#define ADD(a,b) XCAT4(_mm,AVX,_add_p,PP)(a,b)
#define STORE(a,b) XCAT4(_mm,AVX,_store_p,PP)(a,b)
#define STOREU(a,b) XCAT4(_mm,AVX,_storeu_p,PP)(a,b)
#define SHIFTLEFT(a,b) XCAT4(_mm,AVX,_bslli_si,VNB)(a,b)
#define CASTI2F(a) XCAT6(_mm,AVX,_castsi,VNB,_p,PP)(a)
#define CASTF2I(a) XCAT7(_mm,AVX,_castp,PP,_,si,VNB)(a)
#define NUMEL (1 << 10)
int main(){
dtype *a = (dtype*) _mm_malloc(NUMEL*sizeof(dtype), 512/8);
dtype *as = (dtype*) _mm_malloc(NUMEL*sizeof(dtype), 512/8);
for(int i = 0; i < NUMEL; i++){
a[i] = i;
as[i] = i;
}
dtype seri_sum, vec_sum, vec_parallel_sum;
for (int i = 1; i < NUMEL; i++){
as[i] += as[i-1];
}
seri_sum = as[NUMEL-1];
printvar(seri_sum)
dtype sum = 0;
for(int i = 0; i < NUMEL; i+=VLEN){
// vsum = create vector set to sum
VTYPE vsum;
vsum = LOAD1(&sum);
// vec = load vector from arr[i,i+VLEN]
VTYPE vec;
vec = LOAD(&a[i]);
// for as many times as length of the vector
// shift(l) vect
// vsum += vec
vsum = ADD(vsum,vec);
for(int j = 1; j < VLEN; j++){
vec = CASTI2F(SHIFTLEFT(CASTF2I(vec),sizeof(dtype)));
vsum = ADD(vsum,vec);
}
STORE(&a[i], vsum);
sum = a[i+VLEN-1];
}
vec_sum = sum;
printvar(vec_sum)
_mm_free(a);
_mm_free(as);
return 0;
}
Though, I am not too sure why you suggest _mm256_permutevar8x32_ps as opposed to _mm256_bslli_epi128 and _mm256_slli_si256 because the earlier has latency of 3 but the later has latency of 1. I guess because the two later only shift "128-bit" lanes while I need 256-bit lane shift. The earlier has a different _mm256_permutevar8x32_ps naming format which makes it unsuitable for macro-naming. Though, there are _mm_permutevar_pd and _mm_permutevar_ps which are more macro-naming suitable BUT, NOT available on SSEx. (I am a bit confused)
I guess I am just looking for a pair of shift operators with close name formatting so as to make macro-naming perhaps easier/more understandable. (maybe I am overcomplicating things or I am just trying to find the neatest solution) I appreciate if you kindly comments on that please.
Much appreciate it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>the two later only shift "128-bit" lanes while I need 256-bit lane shift.
Your posted code sample was using __m128 vsum.
>>so as to make macro-naming perhaps easier
Why not use templates instead of macros?
>> I am not too sure why you suggest _mm256_permutevar8x32_ps as opposed to _mm256_bslli_epi128 and _mm256_slli_si256 because the earlier has latency of 3 but the later has latency of 1.
The bslli divides the __m256 into two __m128 field, then shifts each the indicated amount. To get the correct result, after computing the result of each half __m256 vector, you would have to then propagate the carry from the low half to the high half of the vector. If (when) your total array size is one vector length you could find a performance improvement. However, note the throughputs are both 1 CPI, thus for two or more (__m256) vectors worth of data you would (should) see savings using the permutevar8x32.
Jim Dempsey
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@jimdempseyatthecove wrote:
>>so as to make macro-naming perhaps easier
Why not use templates instead of macros?
I am not sure ... I have heard some anecdotes that pure C is faster than C++ ( excuse my general statements
>> I am not too sure why you suggest _mm256_permutevar8x32_ps as opposed to _mm256_bslli_epi128 and _mm256_slli_si256 because the earlier has latency of 3 but the later has latency of 1.
The bslli divides the __m256 into two __m128 field, then shifts each the indicated amount. To get the correct result, after computing the result of each half __m256 vector, you would have to then propagate the carry from the low half to the high half of the vector. If (when) your total array size is one vector length you could find a performance improvement. However, note the throughputs are both 1 CPI, thus for two or more (__m256) vectors worth of data you would (should) see savings using the permutevar8x32.
Thanks for your value comments much appreciate it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
C vs C++
As long as your performance code is written without Structured Exception Handling there should be no performance difference.
A completely different issue with programmers is the predilection of using OOP/AOS as opposed to procedure oriented programming with SOA. AOS is typically not vectorizable whereas SOA most often is.
Think of templates as an extended form of macros.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks so much for the crestal clear / insightful details.
I am just curious to know how I can ensure SEH ( Structured Exception Handling) is disabled ? I came across below in the compiler guide.
Thanks for valuable comments.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You would use that option.
Note, you can enable/disable on a file by file basis to get the behavior you want.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Mike,
Gland to know that solution provided my Jim helped you.
Please give us an confirmation if your issue related to this is resolved.
Warm Regards,
Abhishek
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks so much feel free to close the thread.
Regards
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Mike,
Thanks for the confirmation also thanks a lot to Jim for the expert opinion.
As your issue related to this thread has been resolved. We will no longer monitor this thread. If you require any additional assistance from Intel, please start a new thread.
Any further interaction in this thread will be considered community only.
Warm Regards,
Abhishek

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page