topic Re: MKL sgemv performance using multithreading in Intel® oneAPI Math Kernel Library

MKL sgemv performance using multithreading

pilot117 — Sun, 13 Dec 2009 03:17:35 GMT

Hi,

I did several tests for MKL function "sgemv":

A matrix 20000 by 20000
b vec 20000 by 1
x vec 20000 by 1
hardware: 2 quad core xeon cpus.
KMP_AFFINITY=verbose,compact

While set the number of threads to be 4:

omp_set_num_threads(4);
float alpha=1.0;
float gama=0.0;
int index=1;
int N=20000;
sgemv("N", &N, &N, α, A, &N, x, &index, &gama, y, &index); // takes ~92 ms
sgemv("T", &N, &N, α, A, &N, x, &index, &gama, y, &index); // takes ~92 ms

but if I set the number of threads to be 8:
omp_set_num_threads(8);
sgemv("N", &N, &N, α, A, &N, x, &index, &gama, y, &index); // takes ~100 ms
sgemv("T", &N, &N, α, A, &N, x, &index, &gama, y, &index); // takes ~100 ms

NO improvement by increasing the threads, but worse! why?

While set KMP_AFFINITY=verbose,scatter,
for 4 threads sgemv("N"....) takes around 98~104ms
sgemv("T"....) takes around 96~100ms

for 8 threads sgemv("N"....) takes around 102~106ms
sgemv("T"....) takes around 99ms
Seems no difference between 4 and 8 threads.

So are these times are resulted from caching? Any one could explain it a little bit?

many thanks!

Re: MKL sgemv performance using multithreading

barragan_villanueva_ — Wed, 16 Dec 2009 11:28:30 GMT

Quoting - pilot117

Seems no difference between 4 and 8 threads.

So are these times are resulted from caching? Any one could explain it a little bit?

Hi,

You are asking about performance scalability. It should be a linear function depending on number of CPUs. But, taking into account used parallelization algorithm, overhead on threading, memory distribution on caches (e.g. ccNUMA) then scalability is generally difficult to define. Therefore, it's possible to get peak performance on some number of threads. And increasing number of threads just degrades performance.

Also, performance measuringtechnique is important. Please take a look at modified sgemv example and my results below:

#include
#include

#ifndef SIZE
#define SIZE 20000
#endif

#ifndef NT
#define NT 4
#endif

#ifndef CYCLE
#define CYCLE 10
#endif

float A[SIZE][SIZE];
float x[SIZE];
float y[SIZE];
float alpha = 1.0;
float gamma = 0.0;
int index = 1;
int N = SIZE;

int main(int argc, char*argv[]) {
int i;
int nt;

if (argc == 1)
nt = NT;
else
nt = atoi(argv[argc-1]);

MKL_Set_Num_Threads(nt);

for (i=0; i < CYCLE; ++i) // used torun on warm caches
sgemv("N", &N, &N, α, &A[0][0], &N, x, &index, γ, y, &index);
return 0;
}

On Linux, I have the following results with KMP_AFFINITY=compact

% /usr/bin/time ./a.out 1
0.67user 0.15system 0:01.16elapsed 71%CPU

% /usr/bin/time ./a.out 2
0.69user 0.15system 0:00.64elapsed 132%CPU

% /usr/bin/time ./a.out 3
0.81user 0.15system 0:00.42elapsed 230%CPU

% /usr/bin/time ./a.out 4
0.68user 0.20system 0:00.42elapsed 212%CPU

% /usr/bin/time ./a.out 5
0.96user 0.20system 0:00.42elapsed 277%CPU

% /usr/bin/time ./a.out 6
1.04user 0.23system 0:00.42elapsed 300%CPU

% /usr/bin/time ./a.out 7
0.97user 0.51system 0:00.42elapsed 351%CPU

% /usr/bin/time ./a.out 8
0.95user 0.62system 0:00.42elapsed 375%CPU

% /usr/bin/time ./a.out 9
1.20user 0.26system 0:00.42elapsed 349%CPU

Let me know, if there are any questionsabout scalability

Thanks
--Victor

Re: MKL sgemv performance using multithreading

pilot117 — Thu, 17 Dec 2009 04:38:08 GMT

many thanks for your nice information, victor!

Quoting - Victor Pasko (Intel)

Re: MKL sgemv performance using multithreading

Gennady_F_Intel — Fri, 18 Dec 2009 07:37:35 GMT

Victor, what is CPU type you are running on?
is it 64-bit code?
and guess you used the latest version of MKL?
--Gennady

Re: MKL sgemv performance using multithreading

barragan_villanueva_ — Fri, 18 Dec 2009 07:51:15 GMT

Quoting - Gennady Fedorov (Intel)

Victor, what is CPU type you are running on?
is it 64-bit code?
and guess you used the latest version of MKL?
--Gennady

Yes, Gennady.Irun onLinux 64-bitusing the latest MKL.