Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
6980 Discussions

MKL sgemv performance using multithreading

pilot117
Beginner
452 Views
Hi,

I did several tests for MKL function "sgemv":

A matrix 20000 by 20000
b vec 20000 by 1
x vec 20000 by 1
hardware: 2 quad core xeon cpus.
KMP_AFFINITY=verbose,compact

While set the number of threads to be 4:

omp_set_num_threads(4);
float alpha=1.0;
float gama=0.0;
int index=1;
int N=20000;
sgemv("N", &N, &N, α, A, &N, x, &index, &gama, y, &index); // takes ~92 ms
sgemv("T", &N, &N, α, A, &N, x, &index, &gama, y, &index); // takes ~92 ms

but if I set the number of threads to be 8:
omp_set_num_threads(8);
sgemv("N", &N, &N, α, A, &N, x, &index, &gama, y, &index); // takes ~100 ms
sgemv("T", &N, &N, α, A, &N, x, &index, &gama, y, &index); // takes ~100 ms

NO improvement by increasing the threads, but worse! why?


While set KMP_AFFINITY=verbose,scatter,
for 4 threads sgemv("N"....) takes around 98~104ms
sgemv("T"....) takes around 96~100ms

for 8 threads
sgemv("N"....) takes around 102~106ms
sgemv("T"....) takes around 99ms

Seems no difference between 4 and 8 threads.

So are these times are resulted from caching? Any one could explain it a little bit?

many thanks!

0 Kudos
4 Replies
barragan_villanueva_
Valued Contributor I
452 Views
Quoting - pilot117
Seems no difference between 4 and 8 threads.

So are these times are resulted from caching? Any one could explain it a little bit?

Hi,

You are asking about performance scalability. It should be a linear function depending on number of CPUs. But, taking into account used parallelization algorithm, overhead on threading, memory distribution on caches (e.g. ccNUMA) then scalability is generally difficult to define. Therefore, it's possible to get peak performance on some number of threads. And increasing number of threads just degrades performance.

Also, performance measuringtechnique is important. Please take a look at modified sgemv example and my results below:

#include
#include

#ifndef SIZE
#define SIZE 20000
#endif

#ifndef NT
#define NT 4
#endif

#ifndef CYCLE
#define CYCLE 10
#endif

float A[SIZE][SIZE];
float x[SIZE];
float y[SIZE];
float alpha = 1.0;
float gamma = 0.0;
int index = 1;
int N = SIZE;

int main(int argc, char*argv[]) {
int i;
int nt;

if (argc == 1)
nt = NT;
else
nt = atoi(argv[argc-1]);

MKL_Set_Num_Threads(nt);

for (i=0; i < CYCLE; ++i) // used torun on warm caches
sgemv("N", &N, &N, α, &A[0][0], &N, x, &index, γ, y, &index);
return 0;
}

On Linux, I have the following results with KMP_AFFINITY=compact

% /usr/bin/time ./a.out 1
0.67user 0.15system 0:01.16elapsed 71%CPU

% /usr/bin/time ./a.out 2
0.69user 0.15system 0:00.64elapsed 132%CPU

% /usr/bin/time ./a.out 3
0.81user 0.15system 0:00.42elapsed 230%CPU

% /usr/bin/time ./a.out 4
0.68user 0.20system 0:00.42elapsed 212%CPU

% /usr/bin/time ./a.out 5
0.96user 0.20system 0:00.42elapsed 277%CPU

% /usr/bin/time ./a.out 6
1.04user 0.23system 0:00.42elapsed 300%CPU

% /usr/bin/time ./a.out 7
0.97user 0.51system 0:00.42elapsed 351%CPU

% /usr/bin/time ./a.out 8
0.95user 0.62system 0:00.42elapsed 375%CPU

% /usr/bin/time ./a.out 9
1.20user 0.26system 0:00.42elapsed 349%CPU

Let me know, if there are any questionsabout scalability

Thanks
--Victor

0 Kudos
pilot117
Beginner
452 Views
many thanks for your nice information, victor!

Quoting - Victor Pasko (Intel)

0 Kudos
Gennady_F_Intel
Moderator
452 Views
Victor, what is CPU type you are running on?
is it 64-bit code?
and guess you used the latest version of MKL?
--Gennady

0 Kudos
barragan_villanueva_
Valued Contributor I
452 Views
Victor, what is CPU type you are running on?
is it 64-bit code?
and guess you used the latest version of MKL?
--Gennady


Yes, Gennady.Irun onLinux 64-bitusing the latest MKL.
0 Kudos
Reply