- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The following code is to use FPU. I run it on E5-2620. It only upto 2 GFlops. If I want to 2*8 GFlops, how could I code program?
Any help will be appreciated.
void* test_pd_avx()
{
double x[4]={12.02,14.34,34.23,234.34};
double y[4]={123.234,234.234,675.34,3453.345};
__m256d mx=_mm256_load_pd(x);
__m256d my=_mm256_load_pd(y);
for(;;)
{
__m256d mz=_mm256_mul_pd(mx,my);
}
}
The Compiler Option: icc test.c -O0
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
GHui,
If you really want to run at top speed, program in assembly rather than intrinsics.. so you "really" know what's happening with register allocation underneath the covers. To achieve max utilization, you need to use both the "add" and the "mul" pipes on your code, I only see you using the "mul" unit. You need to use 256-bits as well, since the units are 256-bits in size. Another point.. while in some trivial and pointless code you might get full utulization while you're hitting in the L1D.. you want to be able to get this performance even when you're not hitting in the L1D.. maybe the L2 or memory where you're data resides. If you are only focused on the L1D hit case.. what happens when you're getting that data to the L1D.. you're likely not doing any computation during that copy of data to the L1D. To do this, while you're doing computation, also 1) block your data so as to arrange it's locality within the caches, the L2 preferrably, 2) try to reuse loads of data, if the algorithm allows it, so as to lessen the # of LDs per cycle you need to perform, this is important when you're filling data into the L1D while also loading it to the core, you can't fill and load 2 ops at the same time likely due to design constraints of the L1D, 3) make sure you don't have TLB misses.. a TLB miss is wasted cycles, 7 on SB/IB and HW, where you are achieving nothing because you don't have the physical address available to fetch the $line you need from the cache hierarchy or memory (monitor you're TLB miss count per 1000 instructions.. if it's an issue use huge pages or change how you're doing the computation via loop reordering or blocking of data) and 4) determine what the bandwidth constrains are upon your code to get max performance in the FPU, is it even possible, does the L1D have enough bandwidth. You might also use "software prefetch" instructions .. so as to pre-emptively fill data that "you" know you need. HW prefetchers are very useful.. but they'll never be able to predict the future of memory accesses like that algorithm's creater can.
My DGEMM/SGEMM/etc, achieves 96% efficiency on SB/IB and soon HW and it only requires a GB or 2 of bandwidth from the system memory to do so. So.. I speak from experience. Focus on 1T, don't worry about parallelization.. solve the simple problem before you complicate it.
Perfwise
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>You might also use "software prefetch" instructions>>>
For posted code example software prefetch will not be efective.I do not think that hardware prefetch will trigger unless that code causes 2 consecutive misses.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello GHui,
I'm moving this thread to AVX forum (http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions). I hope the AVX forum is more familiar with how to get max performance from AVX.
Pat
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>Was the problem related to software prefetches?>>>
No,but software prefetches were mentioned in responding post.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I mentioned sw pref.. because for "very optimized" code you might want to use it from my experience. In the posted code, it's poor to even try to maintain max mflops. You are not using FMA or the ADD unit. HW can do 2 MUL, 2 FMA, 1 MUL + 1 FMA, 1 FMA + 1 ADD or 1 MUL + 1 ADD per clk. You need a code example which is keeping both pipes busy.. which this example is not. My comments are more generic.. about how you get the best performance through a cpu when considering LD bandwidth, dispatch bandwidth (you can only do 4 uops per clk dispatched), fill bw into the L1D, L2 bandwidth, latency of TLB and memory. Some or all of these come into play in real code.. depending upon what you're trying to do.
Perfwise
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>I mentioned sw pref.. because for "very optimized" code you might want to use it from my experience.>>>
Yes that is true.I also later realized that your comment was more like general advise.
If I am not wrong the posted code example (per one loop cycle) at least two memory loads can be executed per one cpu cycle utilising Port2 and Port3 next loop integer logic will be executed also at the same time even out of order if the there is memory stall and non destructive AVX fmul instruction will be executed.I think that during the fmul execution already memory store reference will be resolved.The longest latency is fmul instruction.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey,
>>Note: 8 CPUs were used
...
Tests are done on Dell Precision Mobile M4700 with Intel Core i7-3840QM ( Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/compare/70846 ).
<<
4 cores == 4 SSE/AVX floating point engines
HT == 8 contexts (set of registers) with two threads of each core sharing the one SSE/AVX floating point engine of the core.
HT will see some advantage over 1 thread per core when a thread stalls for memory or last level cache fetch/store. My experience was about 15% improvement. Your 4.64x improvement / 4 = 1.16, or 16% improvement in your 1 core vs 4 core experiment.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I should also mention that each HT has a complete set of integer ALU's and instruction pipeline. A goodly portion of any program includes integer (address) manipulation. While one HT is say fetching a pointer and computing an offset the other HT sibling has relatively free access to the FPU/SSE/AVX unint (the first thread may have pending operations too, and in which case the advatage to the second thread is diminished). The 15% improvement I experienced is usually found when multiple threads are working on different slices of the same array, which usually also happens to have little integer math other than adding a stride to a register.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Some advantage can be gain also when floating point code does not have data interdependencies this is also relevant to single thread execution port utilisation the best example could be a series of independent fmul or fadd operations.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>HT will see some advantage over 1 thread per core when a thread stalls for memory or last level cache fetch/store. My experience was about 15% improvement. Your 4.64x improvement / 4 = 1.16, or 16% improvement in your 1 core vs 4 core experiment.>>>
There will be also some advantage due to different set of instruction if executed by both threads at the same time.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
>> Use OpenMP ( /Qopenmp ) to parallelize processing and set OpenMP environment variable OMP_NUM_THREADS to 8
>> You could also try to use Intel C++ compiler option /Qparallel:I use pthread affinity to bind the thread to core. And every core is upto 2Gflops.
>>What version of Intel C++ compiler and platform ( OS ) do you use?
I user ics2013.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
>>...HT will see some advantage over 1 thread per core when a thread stalls for memory or last level cache fetch/store.
>>My experience was about 15% improvement. Your 4.64x improvement / 4 = 1.16, or 16% improvement in your
>>1 core vs 4 core experiment.I knew that processing would be done faster and it is good to see that our results are consistent. Thanks for the note, Jim.
If I enable HT, every core's utilization is 1.5Gflops. While I disable HT, every core's utilization is 2Gflops.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
perfwise wrote:
HW can do 2 MUL, 2 FMA, 1 MUL + 1 FMA, 1 FMA + 1 ADD or 1 MUL + 1 ADD per clk. You need a code example which is keeping both pipes busy.. which this example is not.
Thanks, I will try this method.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My purpose for this code is to known what code can get max performance from AVX.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
GHui wrote:
Quote:
Sergey Kostrovwrote:>>...HT will see some advantage over 1 thread per core when a thread stalls for memory or last level cache fetch/store.
>>My experience was about 15% improvement. Your 4.64x improvement / 4 = 1.16, or 16% improvement in your
>>1 core vs 4 core experiment.I knew that processing would be done faster and it is good to see that our results are consistent. Thanks for the note, Jim.
If I enable HT, every core's utilization is 1.5Gflops. While I disable HT, every core's utilization is 2Gflops.
Do you want to use multithreading on that piece of code.You will not speed up the execution of your code and only overhead of thread creation will be greater that running time of that code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
iliyapolak wrote:
Quote:
Do you want to use multithreading on that piece of code.You will not speed up the execution of your code and only overhead of thread creation will be greater that running time of that code.
I am testing a system, in order to known wether the system can get the max performance from AVX.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page