Intel® C++ Compiler
Support and discussions for creating C++ code that runs on platforms based on Intel® processors.
7651 Discussions

Is there any way to vectorize an array roll-up summation / CDF calculation?

mikeitexpert
New Contributor II
1,007 Views

Hi,

I am wondering if there is any way to vectorize the below loop?

#define NUMEL 10000000

int a[NUMEL];

for (int i = 1; i < NUMEL; i++){

a[i] += a[i-1];

}

I guess it needs a SIMD operation like below to be able to do that.

mikeitexpert_0-1622539365017.png

Please let me know if that is possible.

Regards

0 Kudos
1 Solution
jimdempseyatthecove
Black Belt
988 Views

One way is to use intrinsic instructions:

1) load 8-wide vector into register

2) shift register one lane right (shifting in 0) into second register

3) add to first register

4) repeat 2,3 until no remaining original cells (omit last add)

5) store result

6) save the result in a 3rd register

7) fetch the next vector of data

repeat 4 and insert 4.5 to add third register until no remaining original cells (omit last adds)

9) store

10) goto 7 until no more data

Note, the above operation has only two memory references per vector of memory.

 

If you are a student, please attribute this method (your professor likely reads this too).

 

Jim Dempsey

View solution in original post

14 Replies
jimdempseyatthecove
Black Belt
989 Views

One way is to use intrinsic instructions:

1) load 8-wide vector into register

2) shift register one lane right (shifting in 0) into second register

3) add to first register

4) repeat 2,3 until no remaining original cells (omit last add)

5) store result

6) save the result in a 3rd register

7) fetch the next vector of data

repeat 4 and insert 4.5 to add third register until no remaining original cells (omit last adds)

9) store

10) goto 7 until no more data

Note, the above operation has only two memory references per vector of memory.

 

If you are a student, please attribute this method (your professor likely reads this too).

 

Jim Dempsey

mikeitexpert
New Contributor II
934 Views

Hi Jim,

 

Thanks for the hint. I am trying the implement the pseudo code you mentioned. However, I can't find single/double precision shift intrinsic to achieve this. So, I get casting errors from and to __m128i. 

Please let me know your comments.

Regards

 

Below is the code:

#include <iostream>
#include <emmintrin.h>
#include <immintrin.h>
int main(){

	float *a = (float*) _mm_malloc((1 << 20)*sizeof(float), 512/8);
	float *as = (float*) _mm_malloc((1 << 20)*sizeof(float), 512/8);

	float seri_sum, vec_sum, vec_parallel_sum;

	for (int i = 1; i < (1 << 20); i++){
		as[i] += as[i-1];
	}
	seri_sum = as[(1 << 20)-1];
	std::cout << "seri_sum" << " = " << seri_sum << std::endl;

	float sum = 0;
	for(int i = 0; i < (1 << 20); i+=((128)/(sizeof(float)*8))){
		
		__m128 vsum; 
		vsum = _mm_load_ps1(&sum);
		
		__m128 vec;
		vec = _mm_load_ps(&a[i]);
		
		
		
		vsum = _mm_add_ps(vsum,vec);
		for(int j = 1; j < (1 << 20); i++){
			vec = (__m128) _mm_slli_epi32((__m128i) vec,1);
			vsum = _mm_add_ps(vsum,vec);
		}
		_mm_store_ps(&a[i],vsum);
		sum = a[i+((128)/(sizeof(float)*8))-1];
	}

	_mm_free(a);
	_mm_free(as);
	return 0;
}

 

And below is the compile error


C:\tmp\simdtests>make clean test_vec
rm -f *.exp *.lib *.supp *.exe *.obj *.optrpt
icl.exe -Qopt-report:5 -Qopt-report-phase:all /Ob0  -Qopenmp -Qsimd -Qopenmp-simd  -arch:avx  -Qdiag-error-limit:5   -c test_vec.cpp -Fo:test_vec.obj
Intel(R) C++ Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 19.1.2.254 Build 20200623
Copyright (C) 1985-2020 Intel Corporation.  All rights reserved.

icl: remark #10397: optimization reports are generated in *.optrpt files in the output location
test_vec.cpp
test_vec.cpp(32): error: no suitable user-defined conversion from "__m128" to "__m128i" exists
                        vec = (__m128) _mm_slli_epi32((__m128i) vec,1);
                                                                ^

test_vec.cpp(32): error: no suitable user-defined conversion from "__m128i" to "__m128" exists
                        vec = (__m128) _mm_slli_epi32((__m128i) vec,1);
                                       ^

compilation aborted for test_vec.cpp (code 2)
make: *** [Makefile:32: test_vec.obj] Error 2

C:\tmp\simdtests>

 

jimdempseyatthecove
Black Belt
915 Views

Consult the Intel Intrinsics Guide, check the "Cast" checkbox:

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#=undefined&expand=243&text=cast&cats=C...

 

Also, you are compiling with -arch:avx. Why are you using 128-bit vectors?

If your CPU has avx2, then consider __m256 _mm256_permutevar8x32_ps (__m256 a, __m256i idx)

Newer CPUs have AVX512 (and similar intrinsics under swizzle)

Jim Dempsey

 

 

mikeitexpert
New Contributor II
900 Views

Hi Jim,

Thanks so much for valuable comments.

I have a macro-ed version of the code to cover both SSE (up to 4.2), and AVX2 (AVX512 possibly for later).

 

 

#include <iostream>
#include <emmintrin.h>
#include <immintrin.h>

#define _XCAT(a,b) a ## b
#define XCAT(a,b) _XCAT(a,b)
#define XCAT3(a,b,c) XCAT(XCAT(a,b),c)
#define XCAT4(a,b,c,d) XCAT(XCAT3(a,b,c),d)
#define XCAT5(a,b,c,d,e) XCAT(XCAT4(a,b,c,d),e)
#define XCAT6(a,b,c,d,e,f) XCAT(XCAT5(a,b,c,d,e),f)
#define XCAT7(a,b,c,d,e,f,g) XCAT(XCAT6(a,b,c,d,e,f),g)

#define NROWS 10000
#define NCOLS 10000

#define printvar(x) std::cout << #x << " = " << x << std::endl;

// #define dtype double
// #define PP d
// #define NB 64
#define dtype float
#define PP s
#define NB 32


#ifdef __SSE2__
#define VBLEN (128)
#define AVX 
#define VNB 128
#elif defined(__AVX__)
#define VBLEN (256)
#define AVX 256
#define VNB 256
#endif

#if NB == 32
#define VTYPE XCAT(__m,VNB)
#else
#define VTYPE XCAT(__m,VNB,PP)
#endif
#define VTYPEi XCAT(__m,VNB) ## i

#define VLEN (VBLEN/(sizeof(dtype)*8))
#define LOAD1(a) XCAT5(_mm,AVX,_load_p,PP,1)(a)
#define LOAD(a) XCAT4(_mm,AVX,_load_p,PP)(a)
#define LOADU(a) XCAT5(_mm,AVX,_loadu_p,PP)(a)
#define ADD(a,b) XCAT4(_mm,AVX,_add_p,PP)(a,b)
#define STORE(a,b) XCAT4(_mm,AVX,_store_p,PP)(a,b)
#define STOREU(a,b) XCAT4(_mm,AVX,_storeu_p,PP)(a,b)
#define SHIFTLEFT(a,b) XCAT4(_mm,AVX,_bslli_si,VNB)(a,b) 
#define CASTI2F(a) XCAT6(_mm,AVX,_castsi,VNB,_p,PP)(a)
#define CASTF2I(a) XCAT7(_mm,AVX,_castp,PP,_,si,VNB)(a)

#define NUMEL (1 << 10)
int main(){

	dtype *a = (dtype*) _mm_malloc(NUMEL*sizeof(dtype), 512/8);
	dtype *as = (dtype*) _mm_malloc(NUMEL*sizeof(dtype), 512/8);

	for(int i = 0; i < NUMEL; i++){
		a[i] = i;
		as[i] = i;
	}
	dtype seri_sum, vec_sum, vec_parallel_sum;

	for (int i = 1; i < NUMEL; i++){
		as[i] += as[i-1];
	}
	seri_sum = as[NUMEL-1];
	printvar(seri_sum)

	dtype sum = 0;
	for(int i = 0; i < NUMEL; i+=VLEN){
		// vsum = create vector set to sum
		VTYPE vsum; 
		vsum = LOAD1(&sum);
		// vec = load vector from arr[i,i+VLEN]
		VTYPE vec;
		vec = LOAD(&a[i]);
		// for as many times as length of the vector
		//     shift(l) vect
		//	   vsum += vec
		vsum = ADD(vsum,vec);
		for(int j = 1; j < VLEN; j++){
			vec = CASTI2F(SHIFTLEFT(CASTF2I(vec),sizeof(dtype)));
			vsum = ADD(vsum,vec);
		}
		STORE(&a[i], vsum);
		sum = a[i+VLEN-1];
	}
	vec_sum = sum;
	printvar(vec_sum)

	_mm_free(a);
	_mm_free(as);
	return 0;
}

 

 

Though, I am not too sure why you suggest _mm256_permutevar8x32_ps  as opposed to _mm256_bslli_epi128 and  _mm256_slli_si256 because the earlier has latency of 3 but the later has latency of 1. I guess because the two later only shift "128-bit" lanes while I need 256-bit lane shift. The earlier has a different _mm256_permutevar8x32_ps  naming format which makes it unsuitable for macro-naming. Though, there are  _mm_permutevar_pd and  _mm_permutevar_ps which are more macro-naming suitable BUT, NOT available on SSEx. (I am a bit confused)

 

I guess I am just looking for a pair of shift operators with close name formatting so as to make macro-naming perhaps easier/more understandable. (maybe I am overcomplicating things or I am just trying to find the neatest solution) I appreciate if you kindly comments on that please.

Much appreciate it.

 

mikeitexpert_0-1623473852920.png

mikeitexpert_1-1623474480123.png

 

 

jimdempseyatthecove
Black Belt
885 Views

>>the two later only shift "128-bit" lanes while I need 256-bit lane shift.

Your posted code sample was using __m128 vsum.

>>so as to make macro-naming perhaps easier

Why not use templates instead of macros?

 

>> I am not too sure why you suggest _mm256_permutevar8x32_ps  as opposed to _mm256_bslli_epi128 and  _mm256_slli_si256 because the earlier has latency of 3 but the later has latency of 1.

The bslli divides the __m256 into two __m128 field, then shifts each the indicated amount. To get the correct result, after computing the result of each half __m256 vector, you would have to then propagate the carry from the low half to the high half of the vector. If (when) your total array size is one vector length you could find a performance improvement. However, note the throughputs are both 1 CPI, thus for two or more (__m256) vectors worth of data you would (should) see savings using the permutevar8x32.

Jim Dempsey

 

Jim Dempsey

mikeitexpert
New Contributor II
869 Views

@jimdempseyatthecove wrote:

>>so as to make macro-naming perhaps easier

Why not use templates instead of macros?

 

I am not sure ... I have heard some anecdotes that pure C is faster than C++ ( excuse my general statements   ) , though I am not too sure... however, it makes sense because for one thing I know runtime dispatcher/thunks are added to the C++ code when it comes to polymorphism and virtual methods etc.

 

>> I am not too sure why you suggest _mm256_permutevar8x32_ps  as opposed to _mm256_bslli_epi128 and  _mm256_slli_si256 because the earlier has latency of 3 but the later has latency of 1.

The bslli divides the __m256 into two __m128 field, then shifts each the indicated amount. To get the correct result, after computing the result of each half __m256 vector, you would have to then propagate the carry from the low half to the high half of the vector. If (when) your total array size is one vector length you could find a performance improvement. However, note the throughputs are both 1 CPI, thus for two or more (__m256) vectors worth of data you would (should) see savings using the permutevar8x32.


Thanks for your value comments much appreciate it.

jimdempseyatthecove
Black Belt
864 Views

C vs C++

As long as your performance code is written without Structured Exception Handling there should be no performance difference.

A completely different issue with programmers is the predilection of using OOP/AOS as opposed to procedure oriented programming with SOA. AOS is typically not vectorizable whereas SOA most often is.

 

Think of templates as an extended form of macros.

 

Jim Dempsey

mikeitexpert
New Contributor II
830 Views

 

Thanks so much for the crestal clear / insightful details.

I am just curious to know how I can ensure SEH ( Structured Exception Handling) is disabled ? I came across below in the compiler guide.

Thanks for valuable comments.

mikeitexpert_0-1623672975208.png

 

jimdempseyatthecove
Black Belt
829 Views

You would use that option.

Note, you can enable/disable on a file by file basis to get the behavior you want.

Jim Dempsey

jimdempseyatthecove
Black Belt
775 Views

That is the correct option.

Jim Dempsey

AbhishekD_Intel
Moderator
782 Views

Hi Mike,


Gland to know that solution provided my Jim helped you.

Please give us an confirmation if your issue related to this is resolved.



Warm Regards,

Abhishek


mikeitexpert
New Contributor II
776 Views

Thanks so much feel free to close the thread.

Regards

AbhishekD_Intel
Moderator
711 Views

Hi Mike,


Thanks for the confirmation also thanks a lot to Jim for the expert opinion.

As your issue related to this thread has been resolved. We will no longer monitor this thread. If you require any additional assistance from Intel, please start a new thread. 

Any further interaction in this thread will be considered community only.



Warm Regards,

Abhishek


Reply