How to work with AVX on windows

siva_rama_k_ · ‎03-01-2015

Hi,

I am interested in AVX instructions set using in my application for speed up.But i am new to AVX.

How can i know whether my system processor is able to support AVX or not?

My System Configurations as

OS; Windows 7 with 64-bit

CPU: Inter(R) Xeon(R) CPU W3505 @2.53GHz.

Anybody can help me..

Thanks in Advance.

Bernard · ‎03-02-2015

>>>How can i know whether my system processor is able to support AVX or not?>>>

Maximum supported ISA is SSE 4.2 you can check your CPU spec here:

http://ark.intel.com/products/40800/Intel-Xeon-Processor-W3505-4M-Cache-2_53-GHz-4_80-GTs-Intel-QPI

Vladimir_Sedach · ‎03-02-2015

As Iliya said, your CPU supports SSE 4.2 only.

There's though a (limited) workaround:
http://codeforces.com/problemset/customtest
(need to register)
On this site you are able to compile and run your C... code as if it's your machine.

I'd appreciate if you or anybody else point me to some other similar resources with at least AVX ability.

siva_rama_k_ · ‎03-02-2015

Thanks iliyapolak and Vladimir Sedach for your thoughts.

Can u suggest any material to familiar with SSE4.2 instruction set ?

Thanks in Advance

.

Vladimir_Sedach · ‎03-03-2015

All Intel SSE..AVX2 C intrinsics are here:

http://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/index.htm#GUID-27EA00B6-F15F-4EC5-80EA-AFA553204C41.htm

siva_rama_k_ · ‎03-30-2015

Hi,

How to compile SSE4.2 instructions on my Windows 7 OS which supporting SSE4.2 instruction set?

Is it possible compilation of code through IDEs like Microsoft Visual Studio,Code::Blocks using compiles like GNU.

I red flags has to be updated but where can i updated these flags?

Anyone guide me.

Thanks for Advance.

Regards

Siva Rama Krishna

TimP · ‎03-30-2015

With msvc you could use sse4 instructions only by intrinsics. Intel c++ works in visual studio. Arch=sse4.1 often performs better than 4.2.

GCC comes in several versions and does support march=sse4.

Bernard · ‎03-31-2015

You can also use SSE4 instruction in inline assembly, but you will be restricted only to 32-bit code when compiling with MSVC compiler.

siva_rama_k_ · ‎03-31-2015

Thanks Tim Prince and iliyapolak for your inputs.

Can you please provide the document contains step wise instructions to compile code(basic example) using SSE4.2 in MS VS.

OS;Windows 7 64bit

Microsoft Visual Studio (MS VS):12

Bernard · ‎03-31-2015

You need to create project in VS next include relevant to SSE4 header file you also need to choose proper CPU architecture in the project properties.

TimP · ‎03-31-2015

Here's my example which works with msvc and gcc, where the flavor of intrinsics extension is chosen according to pre-definitions of various compilers. Note that MSVC doesn't have a pre-defined macro for the purpose of allowing choice of SSE4. Although the intrinsics in the SSE4 section are available on earlier CPUs, they didn't work well enough to use in this context until SSE4 CPUs became available. Likewise, the AVX code would need more work to run well on the early AVX platforms. GL- is set to permit mixing with Intel compiled objects. The plain C code at the end optimizes well with current gcc, so one may wonder why go to all this trouble to get optimization with the other compilers:

$ make -nf Makefile.windows loopstlm.obj
cl /Ox /EHsc /GL- /openmp /fp:fast -Zi /arch:SSE2 /Qvec-report:1 -c loopstl.cpp > loopstlm.txt 2>&1

for (nl = 1; nl <= i__1; ++nl) {
      // loop must vectorize backwards on account of data overlap
#if defined __AVX2__ // 256-bit unaligned is slow until corei7-4
      // scalar loop to adjust to aligned destination
      for (i__ = *n - 1; (((size_t)&a[i__+1] & 31) < 24); --i__)
      a[i__ + 1] = a[i__] + b[i__];
      // loop on parallel instructions while blocks of 8 remain
      for (; i__ >= 9; i__ -= 8){
      __m256 tmp1 = _mm256_loadu_ps(&a[i__ - 7]),
          tmp2 = _mm256_loadu_ps(&b[i__ - 7]);
      _mm256_store_ps(&a[i__ - 6],_mm256_add_ps(tmp1,tmp2));
      }
      // scalar loop to finish up remainder
      for (; i__ >= 1; --i__)
      a[i__ + 1] = a[i__] + b[i__];
#else
#if defined __SSE4_1__ || defined _M_IX86_FP // early loadu_ps was inefficient
      // scalar loop to adjust to aligned destination
      for (i__ = *n - 1; (((size_t)&a[i__+1] &15) < 12); --i__)
      a[i__ + 1] = a[i__] + b[i__];
      // loop on parallel instructions while blocks of 4 remain
      for (; i__ >= 5; i__ -= 4){
      __m128 tmp1 = _mm_loadu_ps(&a[i__ - 3]),
          tmp2 = _mm_loadu_ps(&b[i__ - 3]);
      _mm_store_ps(&a[i__ - 2],_mm_add_ps(tmp1,tmp2));
      }
      // scalar loop to finish up 3 trip remainder
      // even if there were repeats, this would be superior to loop with
      // optimization for larger trip counts
      for (i__ = 3; i__ >= 1; --i__)
      a[i__ + 1] = a[i__] + b[i__];
#else
#ifdef __INTEL_COMPILER
      a[*n:*n-1:-1]= a[*n-1:*n-1:-1] + b[*n-1:*n-1:-1];
#else
#ifndef __SUNPRO_CC
#warning "SSE4 unseen, dropping to C source"
#endif
      for (i__ = *n - 1; i__ >= 1; --i__)
      a[i__ + 1] = a[i__] + b[i__];
#endif
#endif
#endif