- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, I had reached a special problem and found that the calculation which using for ... loop is more efficiency than use the VM library of MKL. I test for several examples. It shows the same kind of results.
For example. Work with CPU : Intel(R) Xeon(R) Gold 6148 CPU, Intel Parallel Studio 2018u5. After running the 'test', the result shows:
./test
Time for normal distribution
serial: 5s
vector (HA): 12s
vector (LA): 11s
vector (EP): 12s
The compile and link flags are:
-O3 -xHost -DMKL_ILP64 -I${MKLROOT}/include
And for link:
-Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_sequential.a ${MKLROOT}/lib/intel64/libmkl_core.a -Wl,--end-group -lpthread -lm -ldl
I also using Vtune to check the computing. It shows the time consuming of vdExp is almost equal the for...loop of serial part. The source code and Makefile are found in the tar package. The OS is Centos 7.3
I don't understand why the VM library runs so slowly. Is there anything wrong with the FLAGS or codes?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What is the problem size? Please check with the latest MKL 2019 u1.
>> the calculation which using for ... loop is more efficiency than use the VM library of MKL.
actually, in the case of loops, svml implementation of these functions would be called and for short and medium problems, we expect the SVML performance would be better.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your reply.
The size is big, I think. There are two loops.
for(i=0;i<30000; i++)
for(j=0;j<100000; j++)
So it is 30000x100000 = 3e9.
If the for ... loop codes runs slower than which use svml, it is only in this case, to compile the codes by gcc with MKL.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The VML version of your code is memory bound, which is why you see no difference in performance for HA/LA/EP modes. You would need to rearrange the VML version of the code to be cache-blocked to have better performance -- try tests with smaller values of nx to see how this matters. For your inner loops to fit in L2 cache, you should limit nx to something like 10000 or less.
The serial version of the code has much better memory utilization because it makes just one pass through the x and y_serial vectors for each pass of the outer loop. The VML version also has to write and read the three temporary vectors.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page