Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

AVX Optimizations and Performance: VisualStudio vs GCC

James_S_7
Beginner
4,263 Views

Greetings,

   I have recently written some code using AVX function calls to perform a convolution in my software. I have compiled and run this code on two platforms with the following compilation settings of note:

1. Windows 7 w/ Visual Studio 2010 on a i7-2760QM

   Optimization: Maximize Speed (/O2)

   Inline Function Expansion: Only __inline(/Ob1)

   Enable Intrinsic Functions: No

   Favor Size or Speed: Favor fast code (/Ot)

2. Fedora Linux 15 w/ gcc 4.6 on a i7-3612QE

   Flags: -O3 -mavx -m64 -march=corei7-avx -mtune=corei7-avx

For my testing I ran the C implementation and the AVX implementation on both platforms and got the following timing results:

In Visual Studio:

C Implementation: 30ms

AVX Implementation: 5ms

In GCC:

C Implementation: 9ms

AVX Implementation: 57ms

As you can tell my AVX numbers on Linux are very large by comparison. My concern and reason for this post is that I may not have a proper understanding of using AVX and the settings to properly them in both scenarios. For example, take my Visual Studio run. If I change the flag Enable Intrinsics to Yes, my AVX numbers go from 5ms to 59ms. Does that mean disabling the compiler to optimize with intrinsics and manually setting them in Visual Studio give that much better results? Last I checked there is nothing similar in gcc. Could Microsoft be that more capable of a better compile than gcc in this case? Any ideas why my AVX numbers on gcc are just that much larger? Any help is most appreciated. Cheers.

0 Kudos
1 Solution
SergeyKostrov
Valued Contributor II
4,253 Views
>>...2. Fedora Linux 15 w/ gcc 4.6 on a i7-3612QE I recommend you to upgrade GCC to version 4.8.1 Release 4. AVX Performance Tests [ Microsoft C++ compiler VS 2010 ( AVX ) ] ... Matrix Size: 8192 x 8192 Processing... ( Add - 1D-based ) _TMatrixSetF::Add - Pass 01 - Completed: 62.50000 ticks _TMatrixSetF::Add - Pass 02 - Completed: 58.50000 ticks _TMatrixSetF::Add - Pass 03 - Completed: 62.25000 ticks _TMatrixSetF::Add - Pass 04 - Completed: 62.50000 ticks _TMatrixSetF::Add - Pass 05 - Completed: 58.50000 ticks Add - 1D-based - Passed Processing... ( Sub - 1D-based ) _TMatrixSetF::Sub - Pass 01 - Completed: 62.50000 ticks _TMatrixSetF::Sub - Pass 02 - Completed: 66.25000 ticks _TMatrixSetF::Sub - Pass 03 - Completed: 62.25000 ticks _TMatrixSetF::Sub - Pass 04 - Completed: 62.50000 ticks _TMatrixSetF::Sub - Pass 05 - Completed: 62.50000 ticks Sub - 1D-based - Passed ...

View solution in original post

0 Kudos
42 Replies
SergeyKostrov
Valued Contributor II
4,254 Views
>>...2. Fedora Linux 15 w/ gcc 4.6 on a i7-3612QE I recommend you to upgrade GCC to version 4.8.1 Release 4. AVX Performance Tests [ Microsoft C++ compiler VS 2010 ( AVX ) ] ... Matrix Size: 8192 x 8192 Processing... ( Add - 1D-based ) _TMatrixSetF::Add - Pass 01 - Completed: 62.50000 ticks _TMatrixSetF::Add - Pass 02 - Completed: 58.50000 ticks _TMatrixSetF::Add - Pass 03 - Completed: 62.25000 ticks _TMatrixSetF::Add - Pass 04 - Completed: 62.50000 ticks _TMatrixSetF::Add - Pass 05 - Completed: 58.50000 ticks Add - 1D-based - Passed Processing... ( Sub - 1D-based ) _TMatrixSetF::Sub - Pass 01 - Completed: 62.50000 ticks _TMatrixSetF::Sub - Pass 02 - Completed: 66.25000 ticks _TMatrixSetF::Sub - Pass 03 - Completed: 62.25000 ticks _TMatrixSetF::Sub - Pass 04 - Completed: 62.50000 ticks _TMatrixSetF::Sub - Pass 05 - Completed: 62.50000 ticks Sub - 1D-based - Passed ...
0 Kudos
SergeyKostrov
Valued Contributor II
1,018 Views
[ MinGW C++ compiler version 4.8.1 Release 4 ( AVX ) ] ... Matrix Size: 8192 x 8192 Processing... ( Add - 1D-based ) _TMatrixSetF::Add - Pass 01 - Completed: 50.50000 ticks _TMatrixSetF::Add - Pass 02 - Completed: 50.75000 ticks _TMatrixSetF::Add - Pass 03 - Completed: 50.75000 ticks _TMatrixSetF::Add - Pass 04 - Completed: 46.75000 ticks _TMatrixSetF::Add - Pass 05 - Completed: 50.75000 ticks Add - 1D-based - Passed Processing... ( Sub - 1D-based ) _TMatrixSetF::Sub - Pass 01 - Completed: 50.75000 ticks _TMatrixSetF::Sub - Pass 02 - Completed: 50.75000 ticks _TMatrixSetF::Sub - Pass 03 - Completed: 50.50000 ticks _TMatrixSetF::Sub - Pass 04 - Completed: 50.75000 ticks _TMatrixSetF::Sub - Pass 05 - Completed: 50.75000 ticks Sub - 1D-based - Passed ...
0 Kudos
SergeyKostrov
Valued Contributor II
1,018 Views
[ Intel C++ Compiler XE 13.1.0.149 ( AVX ) ] ... Matrix Size: 8192 x 8192 Processing... ( Add - 1D-based ) _TMatrixSetF::Add - Pass 01 - Completed: 50.75000 ticks _TMatrixSetF::Add - Pass 02 - Completed: 54.50000 ticks _TMatrixSetF::Add - Pass 03 - Completed: 50.75000 ticks _TMatrixSetF::Add - Pass 04 - Completed: 50.75000 ticks _TMatrixSetF::Add - Pass 05 - Completed: 50.50000 ticks Add - 1D-based - Passed Processing... ( Sub - 1D-based ) _TMatrixSetF::Sub - Pass 01 - Completed: 50.75000 ticks _TMatrixSetF::Sub - Pass 02 - Completed: 46.75000 ticks _TMatrixSetF::Sub - Pass 03 - Completed: 50.75000 ticks _TMatrixSetF::Sub - Pass 04 - Completed: 50.75000 ticks _TMatrixSetF::Sub - Pass 05 - Completed: 50.75000 ticks Sub - 1D-based - Passed ... Note: Take into account that quality of codes generation, especially for legacy instruction sets, like SSE2 and SSE4, and of course for AVX, of latest version of GCC is improved compared to version 3.4.2. In several my test cases it already outperforms Intel C++ compiler.
0 Kudos
TimP
Honored Contributor III
1,018 Views

OK, situations where I see gcc out-performing icc:

Unrolled source, requiring re-roll optimization, where the compiler replaces the source code unrolling by its own optimization. icc dropped re-roll back around version 10.0.  We can argue that source code unrolling is an undesirable practice, given that modern software and hardware techniques eliminate the need for it in trivial cases.  Unroll-and-jam is another story.

Variable stride (even when written with CEAN so as to get AVX-128 in icc) e.g.

      for (i__ = *n1; i__ <= i__2; i__ += i__3)
          a[i__] += b[*n - (k += j) + 1];

Some cases where intrinsics are used to dictate code generation, or vectorization isn't possible, and the superior automatic unrolling facilities of gcc come in, if you are willing to use them (and adjust the unrolling limit to your CPU) e.g.

CFLAGS = -O3 -std=c99 -funroll-loops --param max-unroll-times=2 -ffast-math -fno-cx-limited-range -march=corei7-avx -fopenmp -g3 -gdwarf-2

Those debug options are recommended for using Windows gcc with Intel(r) VTune(tm), in case you missed the hint.

gcc -ffast-math -fno-cx-limited-range is roughly equivalent to icl -fp:fast=1, the latter being a default.

If you don't study the options, you won't get the best out of gcc.

Intel corei7-4 CPUs want less software unrolling than their predecessors, while gcc's aggressive unlimited unrolling was better suited to the Harpertown generation.  I'm not getting consistently satisfactory results with avx2 with either compiler; avx2 seems to expose bugs in gcc with openmp, while icc drops some i7-2 optimizations which remain better on i7-4.  I'll wait to optimize for corei7-5 when it arrives, if my retirement permits.

It's not usually difficult to discover and correct situations where icc doesn't match gcc performance, while there are many situations where it's easier to get full performance with icc.

0 Kudos
Bernard
Valued Contributor I
1,018 Views

Aggressively unrolling more than two beside  increasing register usage pressure will only utilize two execution Ports(depends on instructions) in parallel per cycle.When coupled only with prefetching outstanding decoded machine code instructions (micro-ops) which correspond  to prefetched data can be pipelined inside SIMD execution stack.Here ICache can speed up execution by caching decoded frequently machine code instructions. 

0 Kudos
Bernard
Valued Contributor I
1,018 Views

@Sergey

 Thanks for posting the results of compiler comparison.

0 Kudos
James_S_7
Beginner
1,018 Views

Sergey,

   I recompiled my software on a machine that had gcc 4.8.2 and even updated my compiler flags to reflect the following:

-O3 -march=core-avx-i -mtune=core-avx-i

I am however getting on the average the exact same timing numbers as before...which to me is very odd. I can't help but to think I am missing something trivial...

Thanks again for your help in this matter.

0 Kudos
Bernard
Valued Contributor I
1,018 Views

Can you post a full disassembly of GCC generated code?Moreover I would like to advise you to perform profiling of FIR code with the help of VTune.

0 Kudos
SergeyKostrov
Valued Contributor II
1,018 Views
>>-O3 -march=core-avx-i -mtune=core-avx-i >> >>I am however getting on the average the exact same timing numbers as before...which to me is very odd. I can't help >>but to think I am missing something trivial... Your set of options is very simple and, I would say, basic. So, you need to use more GCC compiler options and please review as many as possible ( try to tune up your application / this is a very time consuming procedure ). If you use lots of for-loops in codes take a look at how __builtin_assume_aligned internal function needs to be used. There is some "magic" related to that internal function and it really speed ups processing. Take into account that in almost all cases when memory allocated dynamically I use _mm_malloc and _mm_free intrinsic functions ( however, there are some exceptions... ). I'll post more performance results later. This week I've spent a significant amount of time on combining auto-vectorization and manual-software-pipelining. Results are positive and there is ~1.5 percent of improvement in performance.
0 Kudos
SergeyKostrov
Valued Contributor II
1,018 Views
>>...I can't help but to think I am missing something trivial... James, I will follow up on that and I will explain in a generic way how I tune up algorithms. Thanks for the update related to GCC version 4.8.2 and I'll update as well.
0 Kudos
SergeyKostrov
Valued Contributor II
1,018 Views
>>>>...I can't help but to think I am missing something trivial... >> >>James, I will follow up on that and I will explain in a generic way how I tune up algorithms. Please consider a fine-tuning optimization of you convolution algorithm. I use 5 different C++ compilers and in most cases core parts of some algorithms in a project I work for fine-tuned for each C++ compiler. Here is an example ( performance results for a classic matrix multiplication algorithm ): ... #if ( defined ( _WIN32_MGW ) ) #define MatrixMulProcessingCTv1 MatrixMulProcessingCTUnRvA1 #define MatrixMulProcessingCv1 MatrixMulProcessingCUnRvA1 // #define MatrixMulProcessingCTv1 MatrixMulProcessingCTvB1 // #define MatrixMulProcessingCv1 MatrixMulProcessingCvB1 // #define MatrixMulProcessingCTv1 MatrixMulProcessingCTvC1 // #define MatrixMulProcessingCv1 MatrixMulProcessingCvC1 // #define MatrixMulProcessingCTv1 MatrixMulProcessingCTvD1 // #define MatrixMulProcessingCv1 MatrixMulProcessingCvD1 // #define MatrixMulProcessingCTv1 MatrixMulProcessingCTvE1 // *** // #define MatrixMulProcessingCv1 MatrixMulProcessingCvE1 // *** #endif ... [ Performance Results ] Matrix Size : 1024 x 1024 Matrix Size Threshold: N/A Matrix Partitions : N/A ResultSets Reflection: N/A Calculating... ... Test-Case 1 - Version MatrixMulProcessingCTUnRvA1 used ... Classic A - Pass 01 - Completed: 3.17200 secs Classic A - Pass 02 - Completed: 3.17200 secs Classic A - Pass 03 - Completed: 3.17200 secs Classic A - Pass 04 - Completed: 3.17200 secs Classic A - Pass 05 - Completed: 3.17100 secs ... Note: Worst Performance Test-Case 2 - Version MatrixMulProcessingCTvB1 used ... Classic A - Pass 01 - Completed: 2.73500 secs Classic A - Pass 02 - Completed: 2.73400 secs Classic A - Pass 03 - Completed: 2.73400 secs Classic A - Pass 04 - Completed: 2.73400 secs Classic A - Pass 05 - Completed: 2.73500 secs ... Test-Case 3 - Version MatrixMulProcessingCTvC1 used ... Classic A - Pass 01 - Completed: 2.73500 secs Classic A - Pass 02 - Completed: 2.73400 secs Classic A - Pass 03 - Completed: 2.73400 secs Classic A - Pass 04 - Completed: 2.73400 secs Classic A - Pass 05 - Completed: 2.73500 secs ... Test-Case 4 - Version MatrixMulProcessingCTvD1 used ... Classic A - Pass 01 - Completed: 2.71900 secs Classic A - Pass 02 - Completed: 2.71900 secs Classic A - Pass 03 - Completed: 2.71800 secs Classic A - Pass 04 - Completed: 2.71900 secs Classic A - Pass 05 - Completed: 2.70300 secs ... Note: Best Performance Test-Case 5 - Version MatrixMulProcessingCTvE1 used ... Classic A - Pass 01 - Completed: 2.71800 secs Classic A - Pass 02 - Completed: 2.71900 secs Classic A - Pass 03 - Completed: 2.71900 secs Classic A - Pass 04 - Completed: 2.71900 secs Classic A - Pass 05 - Completed: 2.71800 secs ... Note: Best Performance
0 Kudos
SergeyKostrov
Valued Contributor II
1,018 Views
Just for comparison here are results for Microsoft C++ compiler: Test-Case 1 - Version MatrixMulProcessingCTUnRvA1 used ... Classic A - Pass 01 - Completed: 3.34400 secs Classic A - Pass 02 - Completed: 3.32800 secs Classic A - Pass 03 - Completed: 3.32800 secs Classic A - Pass 04 - Completed: 3.32800 secs Classic A - Pass 05 - Completed: 3.31300 secs ... Note: Best Performance ( however, it is slower by ~18 percent compared to MinGW )
0 Kudos
SergeyKostrov
Valued Contributor II
1,018 Views
Here is a set of follow ups...
0 Kudos
SergeyKostrov
Valued Contributor II
1,018 Views
Performance Tests [ MinGW C++ compiler version 4.8.1 Release 4 ] ... Matrix Size: 8192 x 8192 Processing... ( Add - 1D-based ) _TMatrixSetF::Add - Pass 01 - Completed: 511.50000 ticks _TMatrixSetF::Add - Pass 02 - Completed: 511.75000 ticks _TMatrixSetF::Add - Pass 03 - Completed: 511.75000 ticks _TMatrixSetF::Add - Pass 04 - Completed: 511.75000 ticks _TMatrixSetF::Add - Pass 05 - Completed: 507.75000 ticks Add - 1D-based - Passed Processing... ( Sub - 1D-based ) _TMatrixSetF::Sub - Pass 01 - Completed: 511.75000 ticks _TMatrixSetF::Sub - Pass 02 - Completed: 511.75000 ticks _TMatrixSetF::Sub - Pass 03 - Completed: 511.75000 ticks _TMatrixSetF::Sub - Pass 04 - Completed: 511.75000 ticks _TMatrixSetF::Sub - Pass 05 - Completed: 511.75000 ticks Sub - 1D-based - Passed ... [ Intel C++ compiler XE 12.1.7.371 ] ... Matrix Size: 8192 x 8192 Processing... ( Add - 1D-based ) _TMatrixSetF::Add - Pass 01 - Completed: 519.50000 ticks _TMatrixSetF::Add - Pass 02 - Completed: 519.50000 ticks _TMatrixSetF::Add - Pass 03 - Completed: 519.75000 ticks _TMatrixSetF::Add - Pass 04 - Completed: 519.50000 ticks _TMatrixSetF::Add - Pass 05 - Completed: 519.50000 ticks Add - 1D-based - Passed Processing... ( Sub - 1D-based ) _TMatrixSetF::Sub - Pass 01 - Completed: 519.50000 ticks _TMatrixSetF::Sub - Pass 02 - Completed: 519.50000 ticks _TMatrixSetF::Sub - Pass 03 - Completed: 519.50000 ticks _TMatrixSetF::Sub - Pass 04 - Completed: 519.50000 ticks _TMatrixSetF::Sub - Pass 05 - Completed: 519.50000 ticks Sub - 1D-based - Passed ... [ Microsoft C++ compiler VS 2005 ] ... Matrix Size: 8192 x 8192 Processing... ( Add - 1D-based ) _TMatrixSetF::Add - Pass 01 - Completed: 562.50000 ticks _TMatrixSetF::Add - Pass 02 - Completed: 562.50000 ticks _TMatrixSetF::Add - Pass 03 - Completed: 558.50000 ticks _TMatrixSetF::Add - Pass 04 - Completed: 558.75000 ticks _TMatrixSetF::Add - Pass 05 - Completed: 558.50000 ticks Add - 1D-based - Passed Processing... ( Sub - 1D-based ) _TMatrixSetF::Sub - Pass 01 - Completed: 558.50000 ticks _TMatrixSetF::Sub - Pass 02 - Completed: 558.75000 ticks _TMatrixSetF::Sub - Pass 03 - Completed: 558.50000 ticks _TMatrixSetF::Sub - Pass 04 - Completed: 558.50000 ticks _TMatrixSetF::Sub - Pass 05 - Completed: 558.75000 ticks Sub - 1D-based - Passed ... Note: MinGW C++ compiler outperforms Intel C++ compiler by ~2.3 percent and Microsoft C++ compiler by ~9 percent.
0 Kudos
SergeyKostrov
Valued Contributor II
1,018 Views
Here is an example when MinGW C++ compiler outperforms Microsoft C++ compiler: [ MinGW C++ compiler version 4.8.1 Release 4 ] ... Strassen HBI Matrix Size : 2048 x 2048 Matrix Size Threshold: 1024 x 1024 Matrix Partitions : 1 ResultSets Reflection: N/A Calculating... Strassen HBI - Pass 01 - Completed: 20.62500 secs Strassen HBI - Pass 02 - Completed: 20.45300 secs Strassen HBI - Pass 03 - Completed: 20.25000 secs Strassen HBI - Pass 04 - Completed: 20.25000 secs Strassen HBI - Pass 05 - Completed: 20.25000 secs ALGORITHM_STRASSENHBI - Passed Strassen HBC Matrix Size : 2048 x 2048 Matrix Size Threshold: 256 x 256 Matrix Partitions : 400 ResultSets Reflection: Enabled Calculating... Strassen HBC - Pass 01 - Completed: 235.23501 secs Strassen HBC - Pass 02 - Completed: 20.43800 secs Strassen HBC - Pass 03 - Completed: 20.35900 secs Strassen HBC - Pass 04 - Completed: 20.35900 secs Strassen HBC - Pass 05 - Completed: 20.45300 secs ALGORITHM_STRASSENHBC - 1 - Passed ... [ Microsoft C++ compiler VS 2008 ] ... Strassen HBI Matrix Size : 2048 x 2048 Matrix Size Threshold: 1024 x 1024 Matrix Partitions : 1 ResultSets Reflection: N/A Calculating... Strassen HBI - Pass 01 - Completed: 22.04600 secs Strassen HBI - Pass 02 - Completed: 21.96900 secs Strassen HBI - Pass 03 - Completed: 21.98500 secs Strassen HBI - Pass 04 - Completed: 22.31200 secs Strassen HBI - Pass 05 - Completed: 22.09400 secs ALGORITHM_STRASSENHBI - Passed Strassen HBC Matrix Size : 2048 x 2048 Matrix Size Threshold: 256 x 256 Matrix Partitions : 400 ResultSets Reflection: Enabled Calculating... Strassen HBC - Pass 01 - Completed: 261.70301 secs Strassen HBC - Pass 02 - Completed: 23.73500 secs Strassen HBC - Pass 03 - Completed: 23.68700 secs Strassen HBC - Pass 04 - Completed: 23.68800 secs Strassen HBC - Pass 05 - Completed: 23.64000 secs ALGORITHM_STRASSENHBC - 1 - Passed ...
0 Kudos
SergeyKostrov
Valued Contributor II
1,018 Views
Here is an example when Vectorization combined with Software Pipelining improves performance by ~1.5 percent: [ MinGW C++ compiler version 4.8.1 Release 4 - Vectorized ] ... Matrix Size: 10240 x 10240 Processing... ( Add - 1D-based ) _TMatrixSetF::Add - Pass 01 - Completed: 785.25000 ticks _TMatrixSetF::Add - Pass 02 - Completed: 781.25000 ticks _TMatrixSetF::Add - Pass 03 - Completed: 781.25000 ticks _TMatrixSetF::Add - Pass 04 - Completed: 781.25000 ticks _TMatrixSetF::Add - Pass 05 - Completed: 781.50000 ticks Add - 1D-based - Passed Processing... ( Sub - 1D-based ) _TMatrixSetF::Sub - Pass 01 - Completed: 781.25000 ticks _TMatrixSetF::Sub - Pass 02 - Completed: 781.25000 ticks _TMatrixSetF::Sub - Pass 03 - Completed: 781.50000 ticks _TMatrixSetF::Sub - Pass 04 - Completed: 781.25000 ticks _TMatrixSetF::Sub - Pass 05 - Completed: 781.25000 ticks Sub - 1D-based - Passed ... [ MinGW C++ compiler version 4.8.1 Release 4 - Vectorized and Software Pipelined ] ... Matrix Size: 10240 x 10240 Processing... ( Add - 1D-based ) _TMatrixSetF::Add - Pass 01 - Completed: 777.25000 ticks _TMatrixSetF::Add - Pass 02 - Completed: 777.50000 ticks _TMatrixSetF::Add - Pass 03 - Completed: 773.50000 ticks _TMatrixSetF::Add - Pass 04 - Completed: 777.25000 ticks _TMatrixSetF::Add - Pass 05 - Completed: 773.50000 ticks Add - 1D-based - Passed Processing... ( Sub - 1D-based ) _TMatrixSetF::Sub - Pass 01 - Completed: 769.50000 ticks _TMatrixSetF::Sub - Pass 02 - Completed: 769.50000 ticks _TMatrixSetF::Sub - Pass 03 - Completed: 769.50000 ticks _TMatrixSetF::Sub - Pass 04 - Completed: 773.50000 ticks _TMatrixSetF::Sub - Pass 05 - Completed: 769.50000 ticks Sub - 1D-based - Passed ...
0 Kudos
SergeyKostrov
Valued Contributor II
1,018 Views
Here is an example when Software Pipelining improves performance by ~7.2 percent of a legacy Borland C++ compiler version 5.5.1: [ Borland C++ compiler version 5.5.1 - Unrolled Loops 8-in-1 ] ... Matrix Size: 10240 x 10240 Processing... ( Add - 1D-based ) _TMatrixSetF::Add - Pass 01 - Completed: 976.50000 ticks _TMatrixSetF::Add - Pass 02 - Completed: 976.75000 ticks _TMatrixSetF::Add - Pass 03 - Completed: 980.50000 ticks _TMatrixSetF::Add - Pass 04 - Completed: 976.50000 ticks _TMatrixSetF::Add - Pass 05 - Completed: 976.50000 ticks Add - 1D-based - Passed Processing... ( Sub - 1D-based ) _TMatrixSetF::Sub - Pass 01 - Completed: 976.50000 ticks _TMatrixSetF::Sub - Pass 02 - Completed: 976.50000 ticks _TMatrixSetF::Sub - Pass 03 - Completed: 976.50000 ticks _TMatrixSetF::Sub - Pass 04 - Completed: 976.75000 ticks _TMatrixSetF::Sub - Pass 05 - Completed: 976.50000 ticks Sub - 1D-based - Passed ... [ Borland C++ compiler version 5.5.1 - Software Pipelined and Rolled Loops ] ... Matrix Size: 10240 x 10240 Processing... ( Add - 1D-based ) _TMatrixSetF::Add - Pass 01 - Completed: 910.25000 ticks _TMatrixSetF::Add - Pass 02 - Completed: 910.25000 ticks _TMatrixSetF::Add - Pass 03 - Completed: 910.00000 ticks _TMatrixSetF::Add - Pass 04 - Completed: 906.25000 ticks _TMatrixSetF::Add - Pass 05 - Completed: 910.25000 ticks Add - 1D-based - Passed Processing... ( Sub - 1D-based ) _TMatrixSetF::Sub - Pass 01 - Completed: 914.00000 ticks _TMatrixSetF::Sub - Pass 02 - Completed: 914.00000 ticks _TMatrixSetF::Sub - Pass 03 - Completed: 914.00000 ticks _TMatrixSetF::Sub - Pass 04 - Completed: 914.00000 ticks _TMatrixSetF::Sub - Pass 05 - Completed: 918.00000 ticks Sub - 1D-based - Passed ... Note: Vectorization is Not supported by that version of the compiler because it is too old.
0 Kudos
SergeyKostrov
Valued Contributor II
1,018 Views
Also, a priority boost to High or Realtime will improve performance by ~1.5 percent ( applicable to codes compiled with any C++ compiler ):
0 Kudos
Bernard
Valued Contributor I
1,018 Views

Sergey Kostrov wrote:

Also, a priority boost to High or Realtime will improve performance by ~1.5 percent ( applicable to codes compiled with any C++ compiler ):

Hi Sergey

Did you try to disable some hardware like NIC's and rerun your tests?

0 Kudos
SergeyKostrov
Valued Contributor II
928 Views
>>...Did you try to disable some hardware like NIC's and rerun your tests?.. No, I did not disable any hardware and I'm not going to re-run these tests. However, I'm going to post another set of performance results some time later.
0 Kudos
emmanuel_attia
Beginner
928 Views

Seems that Visual Studio has enough hint to put your "kernel" (YMMWORD [ecx]) right into the instruction, that means the it knows that the [ecx] pointer is aligned. It is hard to say more without the source code. But I guess on g++ it does an additional vmovups to load the kernel register, even worst it might be from somewhere actually not aligned, even even worst maybe it does this for every packs of pixel where it can do once for all the loop.

Maybe Visual is being more agressive on the inlining, have you tried that kind of flags on g++ that forces inlining deeply (which is critical in a convolution algorithm if you wrote it in multiple functions / functors )  ?

Is it a good idea to put flags like " -mtune=corei7-avx" that might perform optimization that counter yours ?

0 Kudos
Reply