clBuildProgram with avx

nurbs · ‎12-06-2011

Is it possible to compile an OpenCL kernelwith clBuildProgram using AVX extension? Do I passan option string to the function? And what the string should look like?

Jim_Vaughn · ‎12-13-2011

Hi Nurbs,

The compiler shouldoptimize for what hardware you have on the compilation machine either SSE 4.1, 4.2, or AVX. If you want to compile for AVX on a machine which does not have AVX instructions you should be able to use the Intel offline compiler. The Intel OpenCL SDK User Guide discuses how to set the compiler to use either SSE4.1, 4.2 or AVX. Here is the link to the document, followed by the relavent information from section 5.2.3.

http://software.intel.com/file/39188

Choosing a Different Target Instruction Set Architecture

The Intel OpenCL SDK Offline Compiler tool enables you to choose the target instruction set architecture when building an OpenCL code. It enables you to see the assembly code of different instruction set architectures and to generate program binaries for different hardware platforms.

1. Select Tools > Options...

2. In the Options window, under the Instruction Set Architecture group box uncheck the Use current platform architecture checkbox and select the appropriate ISA from the combo box below. The available items are:

o Streaming SIMD Extension 4.1 (SSE4.1)

o Streaming SIMD Extension 4.2 (SSE4.2)

o Advanced Vector Extension (AVX)

nurbs · ‎12-14-2011

Thanks Jim. I am tryinhg to justify my recommendation of using CPUs that support AVX.My thought nowis Iwill createbinary files of differentoptions, likeSSE4.1, 4.2, AVXetc, and create each of these versions of kernel usingclCreateProgramWithBinary() and then compare the performance results. Does it make sense?

NURBS

Jim_Vaughn · ‎12-15-2011

PLEASE NOTE:I want to stress I am talking somewhat above my understanding of SSE and the AVX instructions sets.

I am pretty sure that isn't going to give you a true qualitative review to justify one over the other to just run the same kernel compiled for both instruction sets. For starters intel (and AMD and NVidia as of recent betas) are moving to and LLVMto optimizeyour code compilation at runtime (gross oversimplification). because of this they might take yourSSE 4.1 compiled code, realize it is running on AVX hardware and optimizethat to take full advantage of teh AVX hardware.I would be very surprised if they didn't.That is nto to saythere will not be a performane delta as I am sure therewill be but it will likely not be a true test of an identical CPU without AVX.Also Intel's opencl implementation don't just use the SSE and AVX instruction setsituses the cpu cores as well. Asfar as I know Intel doesn't maketwo CPUs which are identical except onehas AVX and one doesn't. So tohave ancompletely accurate test you might need to refrashthe question you are trying to answer in order to reach your stated conclusion of justifying a processor with AVX and one that does not.

Also for true performacne on any of the instruction sets you are most likely going to need to tweek your kernel. If you use the same kernel it might be taking full advantage of the SSE 4.1 instruction set but when run AVX it is the same. You probably will want to develop three kernels which solve the same problem but they each haveoptimzied codefor each piece of hardware they are running on.Somtimes with vector hardware (even more so with GPUS) the last 10% of optimization can results ingains fromall but trivial to very signifigant and isdependent ona number of factors.

Now I am sure you are justifying this to someone who isn't interested or truely doesn't need to see that level of justification.First you may contact Intel as they probably have some report on the performance differential between the two instruction sets for varias tasks and they may even have some test code you can run to demonstrate this fact. I have not dougt Intel wants you to succeed in your defense of the quality of their newer processors so you should try and contact them directly (Maybe that is the purpose of your post so I may be recommending what you are doing).

Well I have talked long enough and bored you to death but I wanted to help as I have done a number of these justifcations. In short no that isn't a successful demonstration of the performance differential you should expect but you should try it on a few different kernels. Take the examples and compile them differently and see what performance difference you get. It may be that performance differential is enough to justify your decision. If so great you saved tons of time and effort. If not get two systems one with AVX and one SSE 4.1 and show the differene (I would imagine the improvements accross the board between the two processors will make this a no brainer in terms of increased performance but it may require a board change to your design/system and a redesign always ads signifigantcost). If that isn't enough you can then write custom kernels optimized for each platform and instruction set and get something even more accurate. I can tell you from experience we did this recently and the performance differnce was signifigant (not on the orders of magnatures like on the GPU) but it was considerable and we actually were able to reduce the number of CPUs in a design.

Hope that helps!

Jim