- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I did several tests for MKL function "sgemv":
A matrix 20000 by 20000
b vec 20000 by 1
x vec 20000 by 1
hardware: 2 quad core xeon cpus.
KMP_AFFINITY=verbose,compact
While set the number of threads to be 4:
omp_set_num_threads(4);
float alpha=1.0;
float gama=0.0;
int index=1;
int N=20000;
sgemv("N", &N, &N, α, A, &N, x, &index, &gama, y, &index); // takes ~92 ms
sgemv("T", &N, &N, α, A, &N, x, &index, &gama, y, &index); // takes ~92 ms
but if I set the number of threads to be 8:
omp_set_num_threads(8);
sgemv("N", &N, &N, α, A, &N, x, &index, &gama, y, &index); // takes ~100 ms
sgemv("T", &N, &N, α, A, &N, x, &index, &gama, y, &index); // takes ~100 ms
NO improvement by increasing the threads, but worse! why?
While set KMP_AFFINITY=verbose,scatter,
for 4 threads sgemv("N"....) takes around 98~104ms
sgemv("T"....) takes around 96~100ms
for 8 threads sgemv("N"....) takes around 102~106ms
sgemv("T"....) takes around 99ms
Seems no difference between 4 and 8 threads.
So are these times are resulted from caching? Any one could explain it a little bit?
many thanks!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So are these times are resulted from caching? Any one could explain it a little bit?
Hi,
You are asking about performance scalability. It should be a linear function depending on number of CPUs. But, taking into account used parallelization algorithm, overhead on threading, memory distribution on caches (e.g. ccNUMA) then scalability is generally difficult to define. Therefore, it's possible to get peak performance on some number of threads. And increasing number of threads just degrades performance.
Also, performance measuringtechnique is important. Please take a look at modified sgemv example and my results below:
#include
#include
#ifndef SIZE
#define SIZE 20000
#endif
#ifndef NT
#define NT 4
#endif
#ifndef CYCLE
#define CYCLE 10
#endif
float A[SIZE][SIZE];
float x[SIZE];
float y[SIZE];
float alpha = 1.0;
float gamma = 0.0;
int index = 1;
int N = SIZE;
int main(int argc, char*argv[]) {
int i;
int nt;
if (argc == 1)
nt = NT;
else
nt = atoi(argv[argc-1]);
MKL_Set_Num_Threads(nt);
for (i=0; i < CYCLE; ++i) // used torun on warm caches
sgemv("N", &N, &N, α, &A[0][0], &N, x, &index, γ, y, &index);
return 0;
}
On Linux, I have the following results with KMP_AFFINITY=compact
% /usr/bin/time ./a.out 1
0.67user 0.15system 0:01.16elapsed 71%CPU
% /usr/bin/time ./a.out 2
0.69user 0.15system 0:00.64elapsed 132%CPU
% /usr/bin/time ./a.out 3
0.81user 0.15system 0:00.42elapsed 230%CPU
% /usr/bin/time ./a.out 4
0.68user 0.20system 0:00.42elapsed 212%CPU
% /usr/bin/time ./a.out 5
0.96user 0.20system 0:00.42elapsed 277%CPU
% /usr/bin/time ./a.out 6
1.04user 0.23system 0:00.42elapsed 300%CPU
% /usr/bin/time ./a.out 7
0.97user 0.51system 0:00.42elapsed 351%CPU
% /usr/bin/time ./a.out 8
0.95user 0.62system 0:00.42elapsed 375%CPU
% /usr/bin/time ./a.out 9
1.20user 0.26system 0:00.42elapsed 349%CPU
Let me know, if there are any questionsabout scalability
Thanks
--Victor
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
is it 64-bit code?
and guess you used the latest version of MKL?
--Gennady
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
is it 64-bit code?
and guess you used the latest version of MKL?
--Gennady
Yes, Gennady.Irun onLinux 64-bitusing the latest MKL.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page