- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi All,
I wish to get your kindly help on this basic question.
I find a nested loop in a paper as following: ("Model-Driven SIMD Code Generation for a Multi-Resolution Tensor Kernel")
for(int i = 0; i < LEN2; i++) {
for(int k = 0; k < LEN2; k++) {
for(int j = 0; j < LEN2; j++) {
cc
}
}
}
I modifed it as:
for(int i = 0; i < LEN2; i++) {
for(int k = 0; k < LEN2; k++) {
t = bb
for(int j = 0; j < LEN2; j++) {
cc
}
}
}
I use ICC to test these two pieces of code, I check execution time and the numbers of scalar and vector operations.
- The CPU is Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
- The ICC version is icc (ICC) 12.1.5 20120612.
- OS: GNU/Linux
The output is:
Loop Time(Sec) Checksum
tensor_kernel 2.25 0.210001
PAPI_FP_OPS 8
PAPI_VEC_SP 26369718676
tensor_kernel 20.55 0.210001
PAPI_FP_OPS 9
PAPI_VEC_SP 26264735280
The execution time are quiet different: one is 2.x sec, the other is around 20 sec.
On the other hand, the PAPI result shows that the instruction numbers of scalar and vector instruction operations are similar. To comfirm the result from PAPI, I checked the .s code, and didn't find obviors difference. I dont understand what causes this execution time difference?
Code is as attached, it use the framework of TSVC. (By default the papi version will also be built, you could remove them from list.)
Thank you very much for reading this question.
Best Regards,
Susan
Link Copied
- « Previous
-
- 1
- 2
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Sergey,
Thank you very much for the info. That's really helpful.
Now I understand I should be very careful of this modification, it is better to be used with tiling (or other transformations) to avoid cache miss.
Best Regards,
Susan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This may be off point of this thread. In examining your code it is a matrix multiplication where the destination is neither of the sources. It might not hurt to investigate producing ccT, the transposed value of the eventual result, then after complete computation of ccT, transpose it to produce cc.
Although this adds extra work for the final transpose, it removes latencies by improving vectorization and cache utilization.
Also, it wouldn't hurt to add a test using MKL.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »