Hi Fiori,

Fiori · ‎12-24-2016

Hello!

I want to sum the elements of a vector y. In order to do that I do:

1) create a vector x(i)=1, for all i

2) use the function cblas_?dot

cblas_ddot(n, x, 1, y, 1)

I would like to ask if there is a beter way to do this.

Thank you very much.

TimP · ‎12-24-2016

The primary alternative to MKL BLAS for simd optimization of sum reduction is your compiler's auto-vectorization. Unfortunately, the actions of gcc vs. icc are slightly different, which may deter you from considering them as "best." MSVC++ will not perform simd optimized reduction, so if your unstated ground rules involve that, you may consider the BLAS to be "best."

MKL ?dot will switch over automatically to combined simd and threaded optimization at some large operand length (> 4000 ?), and may select a simd variant automatically at run time. The additional overhead may be significant if your operand is of moderate length (< 400 ?). If you use compiler auto-vectorization and wish such a combination, you may need to write it out in nested loops.

Intel compilers implement both OpenMP simd reduction by #pragma omp simd reduction(+: .....) and parallel threaded reduction by #pragma omp parallel reduction(+: ...). If the code is in the form required by omp simd reduction, the optimization should occur anyway at default compiler flags (preferably with appropriate when the pragma is omitted. gcc should perform the simd optimization without pragma omp when -ffast-math -O3 and suitable -march is set (and will not perform it without -ffast-math even under pragma omp simd reduction), but that can't be recommended without qualification.

gcc will not "riffle" the sum reduction for best performance of a single large sum, but compiler simd optimization is likely to perform better than multiple ?dot function calls.

Combinations of operations bring in further considerations on what may be the best choice.

Sorry no one has guessed your parameters well enough to give a simple answer.

If you are summing a vector of all 1's, obviously dot is not the efficient way.

Fiori · ‎12-24-2016

Thank you for your help, but to be honest I haven't understood the answer. I would appreciate if you could give me a simpler one.

At some point of my code I want to compute the sum of B where B has elements from 5e+3 to 1e+5. I have searched for an appropriate function "sum" but I have found only the function cblas_ddot. So, I have assumed that a way to do that is to create a vector of "ones" and then compute a vector vector dot product. For example:

niters = 1e+6;

for(I=1;I<=niters;I++)

{

do some steps ...

sumLog = 0.0;
for (i = 0; i<nDays; i++)
{
inda = cuma;

cblas_daxpy(DailySize, -phi1, &a[inda], 1, &B[inda+1], 1);

partial = (1.0/sigma2)*(1.0-phi1*phi1)*a[inda] -
(1.0/sigma2)*(phi1 - 1.0)*cblas_ddot(DailySize,&B[inda+1],1,&ones[inda+1],1);

sumLog = sumLog + partial;
}

I

some other steps

} //

mecej4 · ‎12-24-2016

How about the functions cblas_dasum(), cblas_sasum(), etc.?

Fiori · ‎12-24-2016

cblas_dasum(), cblas_sasum() don't work for me because they compute the sum of the absolute values of elements. But my vector have both positive and negative values.

Thank you for your reply.

Ying_H_Intel · ‎12-26-2016

Hi Fiori,

Do you have other intel software installed on your developer machine, like intel C/C++ compiler or Intel Integrate performance Primitive (Intel IPP)?

If with Intel Compiler, you can build your original code with Intel C/C++ compiler, which should be able to speed up the sum code automatically (you don't need to rewrite the original code).

If with Intel IPP , you can call IPP function (like MKL function)

ippsSum_32f(const Ipp32f* pSrc, int len, Ipp32f* pSum, IppHintAlgorithm hint);

Best Regards,
Ying

Example

The example below shows how to use the function ippsSum.

void sum(void) {
Ipp16s x[4] = {-32768, 32767, 32767, 32767}, sm;
ippsSum_16s_Sfs(x, 4, &sm, 1);
printf_16s(“sum =”, &sm, 1, ippStsNoErr);
}
Output:
sum = 32766
Matlab* Analog:
>> x = [-32768, 32767, 32767, 32767]; sum(x)/2

TimP · ‎12-26-2016

Several proprietary implementations of BLAS include ?sum, but MKL doesn't include this extension, presumably because normally it's more efficient simply to write an omp reduction loop or equivalent. Sorry I overlooked this aspect of your subject. You could write your own and achieve more efficiency than a substitution of ?dot with multiplication by a vector of 1's, which certainly isn't popular as a "best" alternative.

McCalpinJohn · ‎12-26-2016

The Intel compiler is capable of doing an excellent job of code generation for a simple summation loop in C, but there are a few things to look out for. So start by rewriting the code to compute the sum with an explicit loop.

Vectorization is most often inhibited by the potential for aliasing.
1. Use the "-qopt-report=5" compiler option and search the optimization report for messages relating to the new explicit sum loop.
2. I have not tested the code above, but the use of indirect addressing (i.e., starting the summation at "cuma") will make it harder for the compiler to do a thorough aliasing analysis. You should be able to force vectorization using "#pragma SIMD" immediately before the summation loop.
3. I have found that the fastest code often comes from using "#pragma omp parallel for reduction (+:sum)", where "sum" is the name of the variable used for summation. Even if you only use one thread, the use of the OpenMP pragma seems to allow the compiler to be more aggressive about re-ordering the computations to improve vectorization.
  1. I have not re-tested this with the most recent (2017) compilers -- it may not be required any more to get best performance, but it is probably still a useful option to test.
  2. I have not tested the "#pragma omp parallel for reduction (+:...)" clause against the "#pragma omp simd reduction(+:..)" clause.
Alignment is a potential secondary performance issue.
1. The performance impact of alignment depends fairly strongly on the processor generation, with a general trend toward better performance of unaligned loads and stores over time.
2. The worst performance problems are with unaligned stores. The reduction operation only requires loads, so alignment is not likely to be a major issue.
3. You can expect to see "complaints" about alignment in the optimization report(s) because of the indirect access (i.e., starting the summation at index "cuma"), but these are much less important than messages about vectorization.

Better way to sum the elements of a vector?