topic Some advantage can be gain in Intel® ISA Extensions

To use FPU

GHui — Fri, 28 Jun 2013 09:13:53 GMT

The following code is to use FPU. I run it on E5-2620. It only upto 2 GFlops. If I want to 2*8 GFlops, how could I code program?

Any help will be appreciated.

void* test_pd_avx()
{
double x[4]={12.02,14.34,34.23,234.34};
double y[4]={123.234,234.234,675.34,3453.345};
__m256d mx=_mm256_load_pd(x);
__m256d my=_mm256_load_pd(y);
for(;;)
{
__m256d mz=_mm256_mul_pd(mx,my);
}
}

The Compiler Option: icc test.c -O0

>>...The following code is to

SergeyKostrov — Mon, 01 Jul 2013 04:42:58 GMT

>>...The following code is to use FPU... Here are a couple of notes: - It doesn't use x86 floating-point unit and it uses SSE floating-point unit - Use OpenMP ( /Qopenmp ) to parallelize processing and set OpenMP environment variable OMP_NUM_THREADS to 8 ... /Qopenmp - enable the compiler to generate multi-threaded code based on the OpenMP* directives ( same as /openmp ) .. - You could also try to use Intel C++ compiler option /Qparallel: ... /Qparallel - enable the auto-parallelizer to generate multi-threaded code for loops that can be safely executed in parallel ... >>...The Compiler Option: icc test.c -O0... What version of Intel C++ compiler and platform ( OS ) do you use?

GHui,

perfwise — Mon, 01 Jul 2013 13:12:03 GMT

GHui,

If you really want to run at top speed, program in assembly rather than intrinsics.. so you "really" know what's happening with register allocation underneath the covers. To achieve max utilization, you need to use both the "add" and the "mul" pipes on your code, I only see you using the "mul" unit. You need to use 256-bits as well, since the units are 256-bits in size. Another point.. while in some trivial and pointless code you might get full utulization while you're hitting in the L1D.. you want to be able to get this performance even when you're not hitting in the L1D.. maybe the L2 or memory where you're data resides. If you are only focused on the L1D hit case.. what happens when you're getting that data to the L1D.. you're likely not doing any computation during that copy of data to the L1D. To do this, while you're doing computation, also 1) block your data so as to arrange it's locality within the caches, the L2 preferrably, 2) try to reuse loads of data, if the algorithm allows it, so as to lessen the # of LDs per cycle you need to perform, this is important when you're filling data into the L1D while also loading it to the core, you can't fill and load 2 ops at the same time likely due to design constraints of the L1D, 3) make sure you don't have TLB misses.. a TLB miss is wasted cycles, 7 on SB/IB and HW, where you are achieving nothing because you don't have the physical address available to fetch the $line you need from the cache hierarchy or memory (monitor you're TLB miss count per 1000 instructions.. if it's an issue use huge pages or change how you're doing the computation via loop reordering or blocking of data) and 4) determine what the bandwidth constrains are upon your code to get max performance in the FPU, is it even possible, does the L1D have enough bandwidth. You might also use "software prefetch" instructions .. so as to pre-emptively fill data that "you" know you need. HW prefetchers are very useful.. but they'll never be able to predict the future of memory accesses like that algorithm's creater can.

My DGEMM/SGEMM/etc, achieves 96% efficiency on SB/IB and soon HW and it only requires a GB or 2 of bandwidth from the system memory to do so. So.. I speak from experience. Focus on 1T, don't worry about parallelization.. solve the simple problem before you complicate it.

Perfwise

>>>You might also use

Bernard — Mon, 01 Jul 2013 16:14:27 GMT

>>>You might also use "software prefetch" instructions>>>

For posted code example software prefetch will not be efective.I do not think that hardware prefetch will trigger unless that code causes 2 consecutive misses.

>>...For posted code example

SergeyKostrov — Mon, 01 Jul 2013 16:20:09 GMT

>>...For posted code example software prefetch will not be efective.. Was the problem related to software prefetches?

Hello GHui,

Patrick_F_Intel1 — Mon, 01 Jul 2013 16:29:33 GMT

Hello GHui,

I'm moving this thread to AVX forum (http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions). I hope the AVX forum is more familiar with how to get max performance from AVX.

Pat

>>>Was the problem related to

Bernard — Mon, 01 Jul 2013 16:52:32 GMT

>>>Was the problem related to software prefetches?>>>

No,but software prefetches were mentioned in responding post.

I mentioned sw pref.. because

perfwise — Mon, 01 Jul 2013 19:07:45 GMT

I mentioned sw pref.. because for "very optimized" code you might want to use it from my experience. In the posted code, it's poor to even try to maintain max mflops. You are not using FMA or the ADD unit. HW can do 2 MUL, 2 FMA, 1 MUL + 1 FMA, 1 FMA + 1 ADD or 1 MUL + 1 ADD per clk. You need a code example which is keeping both pipes busy.. which this example is not. My comments are more generic.. about how you get the best performance through a cpu when considering LD bandwidth, dispatch bandwidth (you can only do 4 uops per clk dispatched), fill bw into the L1D, L2 bandwidth, latency of TLB and memory. Some or all of these come into play in real code.. depending upon what you're trying to do.

Perfwise

>>...If I want to 2*8 GFlops,

SergeyKostrov — Tue, 02 Jul 2013 00:06:27 GMT

>>...If I want to 2*8 GFlops, how could I code program?.. GHui, Here are results of my very quick verification how /Qparallel works. So, if your loop is a simple one ( or a couple of loops ) you could easily acclerate computations without lots of efforts. [ Test 1 - /Qparallel is Not used ] icl.exe /O3 /MD /Qstd=c++0x /Qrestrict /Qansi-alias matmul.cpp Test Started Matrix multiplication C[2048x2048] = A[2048x2048] * B[2048x2048] Intializing matrix data Measuring performance Matrix multiplication completed in 16.39149 seconds Deallocating memory Test Completed Note: 1 CPU was used [ Test 2 - /Qparallel is used ] icl.exe /O3 /MD /Qstd=c++0x /Qrestrict /Qansi-alias /Qparallel matmul.cpp Test Started Matrix multiplication C[2048x2048] = A[2048x2048] * B[2048x2048] Intializing matrix data Measuring performance Matrix multiplication completed in 3.53025 seconds Deallocating memory Test Completed Note: 8 CPUs were used Summary: Take into account that I have Not changed anything in C/C++ codes in order to increase performance. Even if performance increased by ~4.64x you could apply many other tricks to improve that number. Tests are done on Dell Precision Mobile M4700 with Intel Core i7-3840QM ( Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/compare/70846 ).

>>>I mentioned sw pref..

Bernard — Tue, 02 Jul 2013 05:49:19 GMT

>>>I mentioned sw pref.. because for "very optimized" code you might want to use it from my experience.>>>

Yes that is true.I also later realized that your comment was more like general advise.

If I am not wrong the posted code example (per one loop cycle) at least two memory loads can be executed per one cpu cycle utilising Port2 and Port3 next loop integer logic will be executed also at the same time even out of order if the there is memory stall and non destructive AVX fmul instruction will be executed.I think that during the fmul execution already memory store reference will be resolved.The longest latency is fmul instruction.

Sergey,

jimdempseyatthecove — Thu, 04 Jul 2013 17:27:28 GMT

Sergey,

>>Note: 8 CPUs were used
...
Tests are done on Dell Precision Mobile M4700 with Intel Core i7-3840QM ( Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/compare/70846 ).
<<

4 cores == 4 SSE/AVX floating point engines
HT == 8 contexts (set of registers) with two threads of each core sharing the one SSE/AVX floating point engine of the core.

HT will see some advantage over 1 thread per core when a thread stalls for memory or last level cache fetch/store. My experience was about 15% improvement. Your 4.64x improvement / 4 = 1.16, or 16% improvement in your 1 core vs 4 core experiment.

Jim Dempsey

>>...HT will see some

SergeyKostrov — Thu, 04 Jul 2013 22:59:00 GMT

>>...HT will see some advantage over 1 thread per core when a thread stalls for memory or last level cache fetch/store. >>My experience was about 15% improvement. Your 4.64x improvement / 4 = 1.16, or 16% improvement in your >>1 core vs 4 core experiment. I knew that processing would be done faster and it is good to see that our results are consistent. Thanks for the note, Jim.

I should also mention that

jimdempseyatthecove — Sat, 06 Jul 2013 12:53:43 GMT

I should also mention that each HT has a complete set of integer ALU's and instruction pipeline. A goodly portion of any program includes integer (address) manipulation. While one HT is say fetching a pointer and computing an offset the other HT sibling has relatively free access to the FPU/SSE/AVX unint (the first thread may have pending operations too, and in which case the advatage to the second thread is diminished). The 15% improvement I experienced is usually found when multiple threads are working on different slices of the same array, which usually also happens to have little integer math other than adding a stride to a register.

Jim Dempsey

Some advantage can be gain

Bernard — Mon, 08 Jul 2013 17:38:39 GMT

Some advantage can be gain also when floating point code does not have data interdependencies this is also relevant to single thread execution port utilisation the best example could be a series of independent fmul or fadd operations.

>>>HT will see some advantage

Bernard — Mon, 08 Jul 2013 17:49:48 GMT

>>>HT will see some advantage over 1 thread per core when a thread stalls for memory or last level cache fetch/store. My experience was about 15% improvement. Your 4.64x improvement / 4 = 1.16, or 16% improvement in your 1 core vs 4 core experiment.>>>

There will be also some advantage due to different set of instruction if executed by both threads at the same time.

Quote:Sergey Kostrov wrote:

GHui — Wed, 10 Jul 2013 16:15:02 GMT

Sergey Kostrov wrote:

>> Use OpenMP ( /Qopenmp ) to parallelize processing and set OpenMP environment variable OMP_NUM_THREADS to 8
>> You could also try to use Intel C++ compiler option /Qparallel:

I use pthread affinity to bind the thread to core. And every core is upto 2Gflops.

>>What version of Intel C++ compiler and platform ( OS ) do you use?

I user ics2013.

Quote:Sergey Kostrov wrote:

GHui — Wed, 10 Jul 2013 16:19:58 GMT

Sergey Kostrov wrote:

>>...HT will see some advantage over 1 thread per core when a thread stalls for memory or last level cache fetch/store.
>>My experience was about 15% improvement. Your 4.64x improvement / 4 = 1.16, or 16% improvement in your
>>1 core vs 4 core experiment.

I knew that processing would be done faster and it is good to see that our results are consistent. Thanks for the note, Jim.

If I enable HT, every core's utilization is 1.5Gflops. While I disable HT, every core's utilization is 2Gflops.

Quote:perfwise wrote:

GHui — Wed, 10 Jul 2013 16:34:01 GMT

perfwise wrote:

HW can do 2 MUL, 2 FMA, 1 MUL + 1 FMA, 1 FMA + 1 ADD or 1 MUL + 1 ADD per clk. You need a code example which is keeping both pipes busy.. which this example is not.

Thanks, I will try this method.

My purpose for this code is

GHui — Wed, 10 Jul 2013 16:45:21 GMT

My purpose for this code is to known what code can get max performance from AVX.

Quote:GHui wrote:

Bernard — Wed, 10 Jul 2013 16:57:14 GMT

GHui wrote:

Quote:

Sergey Kostrovwrote:
>>...HT will see some advantage over 1 thread per core when a thread stalls for memory or last level cache fetch/store.
>>My experience was about 15% improvement. Your 4.64x improvement / 4 = 1.16, or 16% improvement in your
>>1 core vs 4 core experiment.

I knew that processing would be done faster and it is good to see that our results are consistent. Thanks for the note, Jim.

If I enable HT, every core's utilization is 1.5Gflops. While I disable HT, every core's utilization is 2Gflops.

Do you want to use multithreading on that piece of code.You will not speed up the execution of your code and only overhead of thread creation will be greater that running time of that code.