Re: MKL sgemv performance using multithreading

pilot117 · ‎12-12-2009

Hi,

I did several tests for MKL function "sgemv":

A matrix 20000 by 20000
b vec 20000 by 1
x vec 20000 by 1
hardware: 2 quad core xeon cpus.
KMP_AFFINITY=verbose,compact

While set the number of threads to be 4:

omp_set_num_threads(4);
float alpha=1.0;
float gama=0.0;
int index=1;
int N=20000;
sgemv("N", &N, &N, α, A, &N, x, &index, &gama, y, &index); // takes ~92 ms
sgemv("T", &N, &N, α, A, &N, x, &index, &gama, y, &index); // takes ~92 ms

but if I set the number of threads to be 8:
omp_set_num_threads(8);
sgemv("N", &N, &N, α, A, &N, x, &index, &gama, y, &index); // takes ~100 ms
sgemv("T", &N, &N, α, A, &N, x, &index, &gama, y, &index); // takes ~100 ms

NO improvement by increasing the threads, but worse! why?

While set KMP_AFFINITY=verbose,scatter,
for 4 threads sgemv("N"....) takes around 98~104ms
sgemv("T"....) takes around 96~100ms

for 8 threads sgemv("N"....) takes around 102~106ms
sgemv("T"....) takes around 99ms
Seems no difference between 4 and 8 threads.

So are these times are resulted from caching? Any one could explain it a little bit?

many thanks!

barragan_villanueva_ · ‎12-16-2009

Quoting - pilot117

Seems no difference between 4 and 8 threads.

So are these times are resulted from caching? Any one could explain it a little bit?

Hi,

You are asking about performance scalability. It should be a linear function depending on number of CPUs. But, taking into account used parallelization algorithm, overhead on threading, memory distribution on caches (e.g. ccNUMA) then scalability is generally difficult to define. Therefore, it's possible to get peak performance on some number of threads. And increasing number of threads just degrades performance.

Also, performance measuringtechnique is important. Please take a look at modified sgemv example and my results below:

#include
#include

#ifndef SIZE
#define SIZE 20000
#endif

#ifndef NT
#define NT 4
#endif

#ifndef CYCLE
#define CYCLE 10
#endif

float A[SIZE][SIZE];
float x[SIZE];
float y[SIZE];
float alpha = 1.0;
float gamma = 0.0;
int index = 1;
int N = SIZE;

int main(int argc, char*argv[]) {
int i;
int nt;

if (argc == 1)
nt = NT;
else
nt = atoi(argv[argc-1]);

MKL_Set_Num_Threads(nt);

for (i=0; i < CYCLE; ++i) // used torun on warm caches
sgemv("N", &N, &N, α, &A[0][0], &N, x, &index, γ, y, &index);
return 0;
}

On Linux, I have the following results with KMP_AFFINITY=compact

% /usr/bin/time ./a.out 1
0.67user 0.15system 0:01.16elapsed 71%CPU

% /usr/bin/time ./a.out 2
0.69user 0.15system 0:00.64elapsed 132%CPU

% /usr/bin/time ./a.out 3
0.81user 0.15system 0:00.42elapsed 230%CPU

% /usr/bin/time ./a.out 4
0.68user 0.20system 0:00.42elapsed 212%CPU

% /usr/bin/time ./a.out 5
0.96user 0.20system 0:00.42elapsed 277%CPU

% /usr/bin/time ./a.out 6
1.04user 0.23system 0:00.42elapsed 300%CPU

% /usr/bin/time ./a.out 7
0.97user 0.51system 0:00.42elapsed 351%CPU

% /usr/bin/time ./a.out 8
0.95user 0.62system 0:00.42elapsed 375%CPU

% /usr/bin/time ./a.out 9
1.20user 0.26system 0:00.42elapsed 349%CPU

Let me know, if there are any questionsabout scalability

Thanks
--Victor

pilot117 · ‎12-16-2009

many thanks for your nice information, victor!

Quoting - Victor Pasko (Intel)

Gennady_F_Intel · ‎12-17-2009

Victor, what is CPU type you are running on?
is it 64-bit code?
and guess you used the latest version of MKL?
--Gennady

barragan_villanueva_ · ‎12-17-2009

Quoting - Gennady Fedorov (Intel)

Victor, what is CPU type you are running on?
is it 64-bit code?
and guess you used the latest version of MKL?
--Gennady

Yes, Gennady.Irun onLinux 64-bitusing the latest MKL.