Mixing SSE and AVX inside an application

michaelnikelsky1 · ‎01-16-2011

Hi,

I am currently in the process of adding AVX support to my application.

While floating point avx port looks quite simple, integer avx port is not since there is no integer avx 256 ( :( ). So I need to emulate those with 2 AVX128 instructions. However, there seem to be no AVX128 intrinsics (at least I couldnt find them). But since there is a big penalty for switching between SSE and AVX I need the compiler to generate the AVX128 Integer instructions.

I know about the AVX Compiler flag but they are out of the question since sse code needs stay intact so I can still run the software on plattforms that dont support avx. So the idea is to have two code path and a branch somewhere to choose the code path fitting to the CPU.

So what am I supposed to do to get the compiler to generate AVX128 in one place and SSE instructions in another for the same source file? Why arent there any AVX128 intrinsics?

By the way, I am using the VC2010 at the moment, using the intel compiler would be at lot of work (tried it and there where quite some problems where it couldnt compile the code so that pretty much rules itself out as well although it might be a last resort).

Any hint would be great.
Michael

Brijender_B_Intel · ‎01-16-2011

Hi Michael,
The compiler will generate the AVX (128bit) based on the switches used. The same 128bit intrinsic can be used for SSE and AVX. when compiler sees arch:AVX switch it will generate AVX code for that intrinsic. However, you can still use 256bit AVX instructions if you want to convert integer to float before processing and back before writing it.
Please use VC2010 SP1 if you have access to MSDN.

michaelnikelsky1 · ‎01-17-2011

Yes, I know about the compiler flag (and am already using the VC2010 SP1 Beta). But that is actually the point that sucks since it is impossible to run the binary on older hardware so this is an absolute showstopper for any existing application in my opinion and is therefore totally out of the question. I dont understand why there are no intrinsics to explicitly access the AVX 128 instructions, in my opinion this will prevent many developers of commercial applications to use AVX at all.

Conversion to float isnt an option either since it looses too much accuracy (23 bits wont work for us) and double precision destroys all the benefits of using SIMD at all ( fun fact: I implemented two version of an _mm_div_epi32 function, one which calls 4 normal integer divides on the components of an sse vector, the other which converts an int vector to two double vectors, does two divides and converts the result back. Calling 4 normal divides was about 20-30% faster, at least on a Conroe based system).

The only option I currently see is to have two different dlls and decide on application starttime which dll to load. So you will at least need a build system that can compile the same code twice (or even more) with different compiler settings - not really fun to setup.

bronxzv · ‎01-17-2011

instead of2 DLLs you can also use2 static librariesgenerated fromthe same source (legacy SSE intrinsics), one with the AVX flag. To avoid identifiersclashesC++ namespaces come handy since you can use the name mangling to add automatically a prefix such as "SSE::" and "AVX::"

michaelnikelsky1 · ‎01-19-2011

Ok, that might work as well. But I think I will stick to the two (or more) DLL variant. Requires me quite some code restructuring but seems to be the best and probably the cleanes option. I just hope the speed gain from AVX will be worth the effort.

bronxzv · ‎01-19-2011

recompiling128-bit code just for the 3 operands AVX instructions provides nearly no speedup in real world use cases (less than 1.05 x actual speedups from hands on experience)

though if you have significant code portionsamenable to 256-bit fp code it will be more worth the effort, from my experiments you easily get from 1.2 x to 1.3 x speedups vs SSE for the cases with normalloads/stores, the best speedup I have measured so far is 1.82 x for a case with less than averageloads/store, i.e. most computations with 3 register operands, soyou can expect more gains overall in 64-bit mode since you can have more registers availablefor your constants

michaelnikelsky1 · ‎01-21-2011

??? So....whats your point? I am not talking about the 3 operands AVX instructions, actually when I added the few SSE4.2 instruction I got about 1percent performance increase at most - so I couldnt care less about these 3 operands instruction.
If I had a choice of what I would like to see in AVX it would be quite a lot simpler:
All basic math ( +,-,*,/) and logic (and, or, andnot, xor) and comparison ( <=, <, !=, ==, >, >=) operations implemented for both, int32, int64, float, and double in a way that makes them equal to standard floating point math.
I dont need such blendvps instructions when I can do a simple or(andnot, and) for the same result. But missing a normal integer multiply and divide just sucks. And since it took til SSE 4.2 to actually add a normal integer multiplication, I have very little hope for AVX in this point.

About the significant code portion: Actually I am adding AVX support to a commercial SSE optimized raytracer (look at www.pi-vr.de for more info). AVX support in our case means: instead of just tracing 4 rays at once, we can now trace 8 rays with AVX, so the speedup may be quite nice if everything works out, since I have roughly spoken 300 source files full of SSE code, summing up to many thousand lines of full SSE code.

Anyway, the seperation of the whole code into a new DLL seems to mostly work already (still need to fix some dependencies). So hopefully I will get some AVX results soon.

bronxzv · ‎01-21-2011

>??? So....whats your point?

eh Michael I'm just trying to help you get started, you were saying that

"
explicitly access the AVX 128 instructions, in my opinion this will prevent many developers of commercial applications to use AVX at all.
"

so I was thinking that you were planning to recompile your code just to get the non-destructive 3 operands instruction that AVX-128 is offering for all instructions (but the 4-operand VBLENDVPS), if you plan for AVX-256 after all you can hope for more than 5% speedup andin thiscase it will beindeed sensible to also compile legacy 128-bit code for AVX-128 so that you have no transition penalty

>So hopefully I will get some AVX results soon.

I will be very interested to hearaboutthe speedups you'll get, don't forget to post your findings here!
Withour own realtime 3D rendererwe arestuck at roughly 15% overall speedup with nearly all kernels in AVX-256 mode already, that's quite deceptive so far

michaelnikelsky1 · ‎01-22-2011

Ah, ok, sorry I misunderstood your answer. I should have been more specific of what I am actually doing, I think.

The transition penalty is really what bothered me most and thats why I wonder why there arent any intrinsics to explicitly use avx when you want avx and sse when you need sse. But with two different DLLs I guess the compiler flag will take care of this now (I hope).

15% speedup indeed doest sound that much, I hoped it to be at least 50%. I will post results once I have some, should be next week I hope.

bronxzv · ‎01-22-2011

>But with two different DLLs I guess the compiler flag will take care of this now (I hope

at least with the Intel compiler it works well, when compilinglegacy SSEn intrinsics with the "/QxAVX" flag it generates AVX-128 code that you can freely mix with AVX-256 code (using the 256-bit intrinsics) without any transition penalty

>15% speedup indeed doest sound that much

yes but that's the overall speedup with turbo enabled and 8 threads, we get better speedup with a single thread and turbo off (i.e. our workloads are memory bandwidth bound), also some individual kernels have 50% speedup or more (our best observed speedup is 82 % for a loop), though a lot of loops are at 10% speedup or less, my understanding is that a key limiter to 128-bit to 256-bit scalability is the load bandwidth from the L1D cache, 32B/cycle can be used for SSE(2 loads per cycle) and AVX-256 can't do better 32B/cycle or 1 load per cycle sustained. The L1D$ write bandwidth (16B/clock) is also a strong limiter for all the cases where you copy arrays or set buffers to a value, in these cases the speedup is roughly = 0%

levicki · ‎01-24-2011

@bronxzv:
Looks like you have problem with scaling to multiple cores. Usually an issue with memory bandwidth and data access pattern. You need to find a way to reduce memory bandwidth requirement. Improving data locality is the best way to accomplish that.

@michael:
If you are already writing AVX code path that uses 256-bit AVX FPU, not using 3 operand syntax in the rest of that code path doesn't make any sense.

When transitioning form AVX to legacy SSE you need to use vzeroupper if I remember correctly. Check the optimization reference manual for details on mixing AVX and SSE.

michaelnikelsky1 · ‎01-24-2011

>@michael:
>If you are already writing AVX code path that uses 256-bit AVX FPU, not using 3 operand syntax in the rest of >that code path doesn't make any sense.

Thats not what I meant or said. I just dont really care about these instructions, means I would be fine if they didnt exist and I would even be very happy if instead something like _mm_div_epi32 and _mm_mullo_epi32 would have been there from the very beginning. Having to emulate blendvps by using a series of &, | and so on is easy and simple (and not even really slower...) but just implementing a working _mm_mullo_epi32 was not so easy and I am still missing an implementation of _mm_div_epi32 that is actually faster than just doing 4 simple divides.

But my main issue was that there is no way to expicitly use AVX128 instructions with intrinsics thus the only way to make an application support both, AVX and SSE, is to essentially compile the whole appication twice and more or less let the customer decide which application to start. Just using the SSE intrinsics is a bad idea, as we are told, but since AVX lacks integer support at the moment it is actually the only way to do.

So just adding AVX for a part where it makes sense an let the rest untouched is impossible. I had to really restructure a lot of code to just being able to but all the relevant SSE/AVX code into a single DLL I can switch on startup, I am just glad it worked out at all in our case, I guess in many other cases this will just fail.

levicki · ‎01-24-2011

Let us leave the lack of those instructions aside for a moment.

You can write all your critical functions using AVX intrinsics in a single .cpp file which you compile with /QxAVX, and all your critical SSE functions in another .cpp file which you compile with /QxSSE2. Then you can use Intel Compiler CPU dispatching feature which will select proper function variant to call during runtime.

michaelnikelsky1 · ‎01-24-2011

I am using der VC2010 compiler that doesnt know the dispatching feature. Also I am not shure how this feature behaves with inlined code (and inlining is essentiel for us as it pays of in frames per second). But since the intel compiler wouldnt compile our application the last time we checked with errors I couldnt really decipher (means: code that compiles fine in msvc and gcc and is just consitent with the C++ standard), it is not really an option for us.

As I have written: I am talking about a realtime raytracing system inside a commercial applications with about a million lines of code where the raytracing part is 100% SSE code. This is not a toy application with a few expensive functions, actually we are trying to push the CPU to the limits here.

So you dont want to put roughly spoken 50000-100000 lines of SSE code into a single cpp file. In fact, you dont want to actually rewrite all that code for AVX at all but use templates for it. But then you cant since you cant call the right instructions since the intrinsics are missing.

Anyway, we figured a way out to compile different dlls for the different CPU plattforms and hopefull I will get the AVX port running within this week.

TimP · ‎01-24-2011

Quoting bronxzv

>But with two different DLLs I guess the compiler flag will take care of this now (I hope

at least with the Intel compiler it works well, when compilinglegacy SSEn intrinsics with the "/QxAVX" flag it generates AVX-128 code that you can freely mix with AVX-256 code (using the 256-bit intrinsics) without any transition penalty

>15% speedup indeed doest sound that much

yes but that's the overall speedup with turbo enabled and 8 threads, we get better speedup with a single thread and turbo off (i.e. our workloads are memory bandwidth bound), also some individual kernels have 50% speedup or more (our best observed speedup is 82 % for a loop), though a lot of loops are at 10% speedup or less, my understanding is that a key limiter to 128-bit to 256-bit scalability is the load bandwidth from the L1D cache, 32B/cycle can be used for SSE(2 loads per cycle) and AVX-256 can't do better 32B/cycle or 1 load per cycle sustained. The L1D$ write bandwidth (16B/clock) is also a strong limiter for all the cases where you copy arrays or set buffers to a value, in these cases the speedup is roughly = 0%

In our experience, it was L2 which limited sustained load bandwidth (which is already greater on Sandy Bridge than could be attained on pre-AVX CPUs).
The lack of advantage for AVX on copy and memset operations was always a reasonably well documented "feature."

bronxzv · ‎01-24-2011

>So you dont want to put roughly spoken 50000-100000 lines of SSE code into a single cpp file. In fact, you dont want to actually rewrite all that code for AVX at all but use templates for it. But then you cant since you cant call the right instructions since the intrinsics are missing.

I think our case is pretty much the same, we have roughly 70 k lines of C++ code in around 120 .cpp files for the realtime 3D renderer and other performance critical parts of our engine, indeed it will be a very bad idea to put all of this code in the same file (!) and it makes no sense to duplicate all the code for each target path (it will be plain unmanageable). The best you can do IMO is to work at a higher level with wrapper classes around the intrinsics and inlined functions / operators and simply having some specialized headers for each path (like SSE / SSE-2 / AVX / AVX with FMA3/ whatever). Asensible design methodologyisto havea singlesource coderepository using high level team conventions (ISA-agnostic),each change is directly available for allthe targetpaths and you have very low validation costs & delays. Adding a new path or tuning the primitives ofexisting paths is then 100% orthogonal with the main projects.

levicki · ‎01-24-2011

Intel C++ compiler usually follows C++ standard better than MSVC, so in most cases, the problem is with the code.

Intel Compiler also has options to compile some dubious constructs that MSVC accepts by default.

There are also some compiler bugs with templates and advanced C++ features, but those are detected pretty fast and resolved in updates.

Regarding the size of your project and amount of SSE code in it -- it has always been an unwritten rule that ~10% of the code is responsible for ~90% of the performance, and raytracing is not an exception.

Writiing almost everything with SSE manually doesn't make sense when better compiler can do that for you automatically.

michaelnikelsky1 · ‎01-25-2011

I ended up with having overloaded base floating point and integer classes that I choose based on a compile time define. Our build system is capable of building the same source with different defines so once I have fixed all issues that come from getting form 4 components to 8 components further AVX extensions should be a matter of rewriting parts of the base classes. At least with SSE2/SSE4 switches this already works quite nicely.

michaelnikelsky1 · ‎01-25-2011

About the compile problems: There were issues with some very simply functions like

inline __m128 _mm_madd_ps( const __m128& a, const __m128& b, const __m128& c)
{
return _mm_add_ps( _mm_mul_ps( a, b), c);
}

Looks pretty standard to me (and GCC thinks so as well). Maybe it was a compiler bug but since at the moment there is no advantage for us to switch to a different compiler it just is not an option.

About the amount of SSE code: Be ashured that the amount of SSE-Code is exactly the amount we need and leaving the work to the compiler will make thinks a lot slower (after all, we are competing with GPU based tracers here and can beat them on a Dual CPU workstation, so it looks like we are not doing things too wrong).

Current compiler are just not capable of SIMDifying a ~1000 C++ code lines for a single material shader with multiple (virtual) function calls in it. There is just no way for them to recognize that they can compute 4 or more values in parallel.

Automatic SIMDifying works fine for small loops, but for complex algorithms it just fails, always.

levicki · ‎01-25-2011

Intel customers (and even people who are evaluating and considering Intel C++ Compiler) usually report all the problems they find either through the forum or through the premier support. Problems get fixed that way.

Same goes for auto-vectorization -- the more auto-vectorization problems people report, the smarter the compiler becomes.

Seems that in your case you decided to take an "easy" way out, and avoid participating in compiler enhancement, but at the cost of having to develop and maintain progressively larger and larger hand-written codebase.

In other words, your short-term win may become long-term loss as the size of your project keeps growing.

michaelnikelsky1 · ‎01-25-2011

>Seems that in your case you decided to take an "easy" way out, and avoid participating in compiler >enhancement, but at the cost of having to develop and maintain progressively larger and larger hand-written >codebase.

>In other words, your short-term win may become long-term loss as the size of your project keeps growing.

Needing to wait for a bug to get fixed instead of simply using the compiler that works for us is just a bad idea. Why should we bother using another compiler if what we use now works? For maybe 5% more performance at the cost of 5 times longer compile times? No, thank you. Why do you think you can judge what we are doing? We are developing bleeding edge technologie, running raytracing of multi million triangle scenes on clusters with up to 1000 Cores and 8MP Resolutions in realtime, faster than any competitor and you want to tell me I am taking the easy way out because I prefer a solution that works now instead of having to wait for someone else to fix bugs in a compiler? Seriously, I had enough trouble with GPU manufactures and their drivers bugs, I am happy if things just work like I tell them and can concentrate myself of algorithms.

I am saying you dont have a clue about what we are actually programming here, so you better shouldnt start to judge.

And who says handwritten SSE code ist larger than handwritten standard c++ code? I think, I can handle the required masks from time to time. And the code needs to be written anyway.