topic a large array on the sum of the optimization problem in Intel® oneAPI Math Kernel Library

a large array on the sum of the optimization problem

fyten1985 — Fri, 26 Mar 2010 07:53:12 GMT

^{This is a large array on the sum of the optimization problem.
There are two double type array, then the code for this problem is as follows:
#pragma omp parallel for
for (long i=0; i<5000000; i++)
{
array1 += array2;
}
My computer is "Dell PowerEdge 2900III 5U" with Xeon 5420 * 2 and 48G Memory.
And the OS is MS Windows Server 2003 R2 Enterprise x64 Edition sp2.
The C++ compilers are VC++ 2008 and Intel C++ 11.0.061, and the solution platform is x64.
and then i used VC and IC compiled the program,the two result are basiclly the same.
and then i used the funtion of INTEL MKL 10.1 to compute,as follows:
cblas_daxpy(5000000, 1, array2, 1, array1, 1);
the performance of the program have no different.
and the i used other funtino of INTEL MKL 10.1:
vdAdd( n, a, b, y );
Program performance decreased significantly, and only about 80% of the original.
i would like to know what way to optimize this problem by enhancing program performance}

a large array on the sum of the optimization problem

Gennady_F_Intel — Mon, 29 Mar 2010 09:46:16 GMT

I reproduced the result - it seems to me that VML optimized much better for shorter length than for such pretty long.

For example for N = 100000, vdAdd / cblas_daxpy ~ 0.4 ( core 2 Duo) but for N = 10^6,

vdAdd / cblas_daxpy ~ 1.3.

I will ask the expert team of VML to shed light on this problem.

--Gennady

a large array on the sum of the optimization problem

Nikita_A_Intel — Mon, 29 Mar 2010 11:15:25 GMT

Vector size 1000 10000 elements is the typical VML usage model (data should fit in caches). In this case vdAdd works faster or close to BLAS or Compiler-generated loop because of threads creation overhead in BLAS and Compiler loop. vdAdd doesnt use threading (this is a known limitation and we work on it) and so it cannot compete in case of large vector lengths in multithread environment. Moreover in your case vdAdd suffers from cache misses in the large vector lengths cases more than BLAS or Compiler because your test case uses separate memory array y for results. You shall see better performance if you write vdAdd( n, a, b, a ).
Thanks,
Nikita