Mixing SSE and AVX inside an application - Page 2

michaelnikelsky1 · ‎01-16-2011

Hi,

I am currently in the process of adding AVX support to my application.

While floating point avx port looks quite simple, integer avx port is not since there is no integer avx 256 ( :( ). So I need to emulate those with 2 AVX128 instructions. However, there seem to be no AVX128 intrinsics (at least I couldnt find them). But since there is a big penalty for switching between SSE and AVX I need the compiler to generate the AVX128 Integer instructions.

I know about the AVX Compiler flag but they are out of the question since sse code needs stay intact so I can still run the software on plattforms that dont support avx. So the idea is to have two code path and a branch somewhere to choose the code path fitting to the CPU.

So what am I supposed to do to get the compiler to generate AVX128 in one place and SSE instructions in another for the same source file? Why arent there any AVX128 intrinsics?

By the way, I am using the VC2010 at the moment, using the intel compiler would be at lot of work (tried it and there where quite some problems where it couldnt compile the code so that pretty much rules itself out as well although it might be a last resort).

Any hint would be great.
Michael

levicki · ‎01-26-2011

I am not trying to judge you, so you can hold your horses.

I have put the word "easy" under quotes because I don't think either path is really easy. What I am doing is stating some obvious facts.

Intel has excellent track record of fixing bugs that are reported, sometimes even providing a specific fix over the FTP to a customer, and there is always a workaround in the meantime. I sincerely doubt that you would get such a treatment with GNU or MSVC in case you hit a bug there.

Regarding the SSE .vs. C/C++ code size, one intrinsic usually maps to one instruction, while one row of C/C++ code can map to several instructions. I was saying that SSE code is generally larger in terms of numbers of lines of code, not in terms of instruction count.

Regarding possible advantage, it may be more than 5% when you factor in global optimizations and complex code transformations Intel compiler is capable of doing, not to mention readily available performance libriaries, but it looks like you will never find out how much advantage it may bring you.

I worked on 3D image reconstruction (back projection, forward projection) for medical purposes both on CPU and on a GPU, but you are right, I really don't have a clue... why you came here to ask people who don't have a clue for help and advice, when you already know what is best for you. MSDN/Technet forum might be a better place for your arrogant attitude.

bronxzv · ‎01-26-2011

Igor, this is the AVX forum not the Intel C++ forum.Why don't you simplylet people doing these days actual AVX development talk freely together ? You have nomoderator credential here (pls correct me if I'm wrong) to say who can post, what to post, who should go away because they use different methodologies than you.

michaelnikelsky1 · ‎01-26-2011

>Regarding possible advantage, it may be more than 5% when you factor in global optimizations and >complex code transformations Intel compiler is capable of doing, not to mention readily available >performance libriaries, but it looks like you will never find out how much advantage it may bring you.

We did test it 1 1/2 years ago and it was about 5 percent more performance while taking a lot longer to compile. MSVC is not so bad anymore as it was back in the VC6 and 7 Versions. It can do global optimizations and profile guided optimizations as well. There is still a performance advantage to the intel compiler but it is not worth the additional effort we would need to put into switching the compiler. And I didnt came here for any compiler discussions. Last time I checked, this was an AVX forum, not an intel compiler forum.

>I worked on 3D image reconstruction (back projection, forward projection) for medical purposes both on >CPU and on a GPU, but you are right, I really don't have a clue... why you came here to ask people who >don't have a clue for help and advice, when you already know what is best for you. MSDN/Technet forum >might be a better place for your arrogant attitude.

Yes, you dont have a clue about what we are doing here. I didnt say you dont have a clue about programming or your own work. But you definetly have no clue about why we chose to use SSE everywhere inside the raytracer (as I already said: Because it is faster and automatic optimization fails for our tasks and it will probably always fail unless it implements a dynamic scheduler that can figure out which rays need to be traced and which are already terminated).

And about the arrogant attitude: I dont like being told the problem is my code (which compiles fine with msvc AND gcc) and to be told that I am doing stupid things because I have a 100% SSE optimized raytracer instead of just a few SSE functions. So which one of use is more arrogant?

I came here to maybe find a solution for a problem ( missing AVX128 intrinsics). The solutions that I knew (and were presented here) are all flawed for various reasons in my opinion so I hoped to find an alternative. But since there didnt seem to be one, I adjusted myself by restructuring m code a littlebit and putting everything into a DLL I can compile with different flags. Not a solution I had hoped for, but it seems to work, problem solved.

levicki · ‎01-26-2011

@bronxzv:

You are right, I do not have admin rights nor I would like to have that burden on my shoulders.

However, if you re-read my posts in this thread, you will see that I was trying to be helpfull as usual, and in return I was told twice that I am incompetent / don't have a clue, etc.

Furthermore, Michael was very brash and unpleasant in his replies from the beginning (even towards you), and I feel that I have the right as a senior member of the ISN community (if not as a Black Belt) to say something about it, especially when his replies have offended me.

@Michael:

I wrote "in most cases, the problem is with the code", not "the problem is in your code". For me, there is a considerable difference between the two, even though English is not my primary language.

bronxzv · ‎01-26-2011

>Furthermore, Michael was very brash and unpleasant in his replies from the beginning (even towards you),

sure, you're right, though I think we are now one step further and I'll be very interested to hear about the speedups Michael will get with his raytracer, if he posts is findings on some MS forums I'll be not aware since I dont visit these very often

michaelnikelsky1 · ‎01-26-2011

I am sorry to you both. I didnt mean to be rude at all. I just couldnt see the relevance of the answers to my question. I have been pretty clear of the options I know in my very first post and just getting the same options I already know and have discarded for good reasons presented as answers is a bit disappointing and frustrating.

@Igor: Once again, I didnt say you are incompetent, in no way. You just dont know our source code and cant judge our decisions in any way. And I personally felt offended by something like

>Writiing almost everything with SSE manually doesn't make sense when better compiler can do that for you >automatically.

So you were essentially saying what we are doing doesnt make any sense although you are not in any position to give a qualified judgement. And being called arogant for saying "you have not clue" (because you dont know our code and therefore your solutions were impossible to apply to our problem) offended me quite a bit.

bronxzv · ‎01-26-2011

>I just couldnt see the relevance of the answers to my question

the title of your 1st post read "Mixing SSE and AVX inside an application" and my advice to compile the same files with different options and using C++ namespaces to avoid identifiers collisions at link time is arguably at least somewhat relevant. IMHO it's even the best way to deal with the issue and it can be used with all C++ compilers, btw the fact that there is no VEX-128 intrinsics (fp or int) is an orthogonal issue since the collision at link time comes from the fact that we compile several times the same file and will be not avoided with a 128-bit variant of all intrinsics (and without duplicating all your source code). For example in our case one way to deliver the renderer is through a freeware web player (NPAPI plugin or Active X)

http://www.inartis.com/Products/Kribi%203D%20Player/Default.aspx

andit's arguably better to have all the application in a single .dll or .ocx file instead of several DLLs which will be more difficult to install and update

michaelnikelsky1 · ‎01-26-2011

Yes, of course that had relevance and that wasnt one of the answers I was referring to. For us using a single DLL for the whole application wouldnt work (or it would be a 200MB DLL), the raytracing (SSE) part is only a small part of the whole application.

About the VEX-128 intrinsics and why I really miss them:

My idea was the following: Every function has a template parameter specifying a versionId/whatever. So you implement the function only once and let the compiler create the functions as often as needed. The beauty about this approach would be that you would only need to write a function once for a templated base type and let the compiler implement it more often for 1, 4, 8, 16,... component vectors. Using some few special functions you can then decide during runtime which function to call based on the active rays in a packet, always choosing the optimal code path.

Now extending this approach to support AVX would just be a matter of implementing the base types for AVX as well and adding an entry point to call the image trace function with the required template parameter (which would just be any int). No need to reimplement any functions of the main raytracer. But then, this doesnt work since VEX-128 and integer VEX-256 are missing. So using a DLL for AVX and one for SSE is ok but it makes build process a bit more complicated.

Anyway, I really didnt want to offend anyone, so sorry again if I did.

bronxzv · ‎01-26-2011

>. The beauty about this approach would be that you would only need to write a function once for a templated base type

we do just that, though without using templates but simply including special headers for each target path
here is a very simpleexample of source code

const PFloat oSpotR(lightColor.x),oSpotG(lightColor.y),oSpotB(lightColor.z);
const ULONG count = lBunch.samplesCount;

#pragma unroll(4)
for (ULONG i=0; i {
const PFloat ok = PFloat(qFk+i) * PFloat(qAs+i);
MAcc(r+i,PFloat(lBunch.fColR+i),oSpotR*ok);
MAcc(g+i,PFloat(lBunch.fColG+i),oSpotG*ok);
MAcc(b+i,PFloat(lBunch.fColB+i),oSpotB*ok);
}

headers for the SSE pathdefine FP_VEC_WIDTH = 4, PFloat(i.e. packed float) is a wrapper class around__m128 where operator *is based on _mm_mul_ps, etc

for AVX FP_VEC_WIDTH = 8, PFloatis a wrapper class around__m256 where operator *is based on _mm256_mul_ps, etc

FYI the ASM dump for the AVX path (without FMA3) is shown below exactly as generated by the compiler, it's unrolled for the best performance from actual timings on a 2600K based PC

; LOE eax esi edi ymm4 ymm5 ymm6
.B11.82: ; Preds .B11.82 .B11.81

;;; {
;;; const PFloat ok = PFloat(qFk+i) * PFloat(qAs+i);

mov ecx, DWORD PTR [16+ebx] ;449.45
vmovups ymm7, YMMWORD PTR [-888+ebp+edi] ;449.38
vmulps ymm3, ymm7, YMMWORD PTR [edi+ecx] ;449.45

;;; MAcc(r+i,PFloat(lBunch.fColR+i),oSpotR*ok);

vmulps ymm0, ymm6, ymm3 ;450.48

;;; MAcc(g+i,PFloat(lBunch.fColG+i),oSpotG*ok);

vmulps ymm7, ymm5, ymm3 ;451.48

;;; MAcc(b+i,PFloat(lBunch.fColB+i),oSpotB*ok);

vmulps ymm3, ymm4, ymm3 ;452.48
mov DWORD PTR [-904+ebp], eax ;
mov eax, DWORD PTR [160+esi] ;450.26
mov edx, DWORD PTR [28+ebx] ;450.7
vmulps ymm1, ymm0, YMMWORD PTR [eax+edi] ;450.7
vaddps ymm2, ymm1, YMMWORD PTR [edi+edx] ;450.7
vmovups YMMWORD PTR [edi+edx], ymm2 ;450.7
vmovups ymm2, YMMWORD PTR [-856+ebp+edi] ;449.38
mov eax, DWORD PTR [164+esi] ;451.26
vmulps ymm0, ymm7, YMMWORD PTR [eax+edi] ;451.7
mov eax, DWORD PTR [32+ebx] ;451.7
vaddps ymm1, ymm0, YMMWORD PTR [edi+eax] ;451.7
vmovups YMMWORD PTR [edi+eax], ymm1 ;451.7
mov eax, DWORD PTR [168+esi] ;452.26
vmulps ymm0, ymm3, YMMWORD PTR [eax+edi] ;452.7
mov eax, DWORD PTR [36+ebx] ;452.7
vaddps ymm1, ymm0, YMMWORD PTR [edi+eax] ;452.7
vmovups YMMWORD PTR [edi+eax], ymm1 ;452.7
vmulps ymm2, ymm2, YMMWORD PTR [32+edi+ecx] ;449.45
vmulps ymm3, ymm6, ymm2 ;450.48
vmulps ymm1, ymm5, ymm2 ;451.48
vmulps ymm2, ymm4, ymm2 ;452.48
mov ecx, DWORD PTR [160+esi] ;450.26
vmulps ymm7, ymm3, YMMWORD PTR [32+ecx+edi] ;450.7
vaddps ymm0, ymm7, YMMWORD PTR [32+edi+edx] ;450.7
vmovups YMMWORD PTR [32+edi+edx], ymm0 ;450.7
mov ecx, DWORD PTR [164+esi] ;451.26
vmulps ymm3, ymm1, YMMWORD PTR [32+ecx+edi] ;451.7
mov ecx, DWORD PTR [32+ebx] ;451.7
vaddps ymm7, ymm3, YMMWORD PTR [32+edi+ecx] ;451.7
vmovups YMMWORD PTR [32+edi+ecx], ymm7 ;451.7
mov ecx, DWORD PTR [168+esi] ;452.26
vmulps ymm0, ymm2, YMMWORD PTR [32+ecx+edi] ;452.7
vmovups ymm2, YMMWORD PTR [-824+ebp+edi] ;449.38
vaddps ymm1, ymm0, YMMWORD PTR [32+edi+eax] ;452.7
vmovups YMMWORD PTR [32+edi+eax], ymm1 ;452.7
mov ecx, DWORD PTR [16+ebx] ;449.45
vmulps ymm1, ymm2, YMMWORD PTR [64+edi+ecx] ;449.45
vmulps ymm3, ymm6, ymm1 ;450.48
vmulps ymm2, ymm5, ymm1 ;451.48
vmulps ymm1, ymm4, ymm1 ;452.48
mov ecx, DWORD PTR [160+esi] ;450.26
vmulps ymm7, ymm3, YMMWORD PTR [64+ecx+edi] ;450.7
vaddps ymm0, ymm7, YMMWORD PTR [64+edi+edx] ;450.7
vmovups YMMWORD PTR [64+edi+edx], ymm0 ;450.7
mov ecx, DWORD PTR [164+esi] ;451.26
vmulps ymm3, ymm2, YMMWORD PTR [64+ecx+edi] ;451.7
vmovups ymm2, YMMWORD PTR [-792+ebp+edi] ;449.38
mov ecx, DWORD PTR [32+ebx] ;451.7
vaddps ymm7, ymm3, YMMWORD PTR [64+edi+ecx] ;451.7
vmovups YMMWORD PTR [64+edi+ecx], ymm7 ;451.7
mov ecx, DWORD PTR [168+esi] ;452.26
vmulps ymm0, ymm1, YMMWORD PTR [64+ecx+edi] ;452.7
vaddps ymm1, ymm0, YMMWORD PTR [64+edi+eax] ;452.7
vmovups YMMWORD PTR [64+edi+eax], ymm1 ;452.7
mov ecx, DWORD PTR [16+ebx] ;449.45
vmulps ymm0, ymm2, YMMWORD PTR [96+edi+ecx] ;449.45
vmulps ymm3, ymm6, ymm0 ;450.48
vmulps ymm2, ymm5, ymm0 ;451.48
vmulps ymm0, ymm4, ymm0 ;452.48
mov ecx, DWORD PTR [160+esi] ;450.26
vmulps ymm7, ymm3, YMMWORD PTR [96+ecx+edi] ;450.7
vaddps ymm1, ymm7, YMMWORD PTR [96+edi+edx] ;450.7
vmovups YMMWORD PTR [96+edi+edx], ymm1 ;450.7
mov edx, DWORD PTR [164+esi] ;451.26
vmulps ymm3, ymm2, YMMWORD PTR [96+edx+edi] ;451.7
mov edx, DWORD PTR [32+ebx] ;451.7
vaddps ymm7, ymm3, YMMWORD PTR [96+edi+edx] ;451.7
vmovups YMMWORD PTR [96+edi+edx], ymm7 ;451.7
mov ecx, DWORD PTR [168+esi] ;452.26
vmulps ymm0, ymm0, YMMWORD PTR [96+ecx+edi] ;452.7
vaddps ymm1, ymm0, YMMWORD PTR [96+edi+eax] ;452.7
vmovups YMMWORD PTR [96+edi+eax], ymm1 ;452.7
add edi, 128 ;447.5
mov eax, DWORD PTR [-904+ebp] ;447.5
inc eax ;447.5
cmp eax, DWORD PTR [-916+ebp] ;447.5
jb .B11.82 ; Prob 27% ;447.5

michaelnikelsky1 · ‎01-26-2011

I am not shure I understand what exactly you are doing. For example how would you implement something like this:

An AVX-Ray Packet with 8 rays in it hits 2 surfaces, one with material A, the other with material B. Now lets assume one ray in the packet hits material A, 7 hit material B. Calling material Bs shade function for the whole ray packet with masking out the one ray that hits something else is ok. But calling the same shade function of material A with 7 rays masked out is just burning CPU cycles. So ideally you would like to call a shade function that is specialized for handling a single ray. With templates you could have something like

template
class FloatType
{...};

template<>
class FloatType<1>
{....};

template<>
class FloatType<4>
{...};

and a shade function like

template
void shade()
{
typedef FloatType floatType;
...
do work with floatType
}

The calling code would be something like (ray extracing code omitted)

switch( _mm_movemask_ps( activeRays))
{
case 1:
shade<1>();
break;
case 4:
shade<4>();
}

I am not shure how you want to do this with includes.

bronxzv · ‎01-26-2011

the only way to get good speedups with SSE/AVX-256 is to use all 4/8 computationslots whenever possible, so calling a scalar path like you'll do for material A is out of the question (btw scalar SSE vs packed SSE will havethe same throughput anyway) what you have to do is to aggregate together rays for each material in a first pass(using an operation like VCOMPRESS in LRBni) then process each material in turn with as much as possible samples/rays packed together, this way you also maximize thetemporal coherence of your memory accesses (for examplefor texture fetch)

also note that your example with a "switch" statement will endure heavy branch prediction misses, thiswill be one more limiter to the scalability from SSE to AVX-256

michaelnikelsky1 · ‎01-26-2011

Of course you are right but it is not always possible to keep the slots full, especially at high ray depths. But then, this is a limitation in the current design of our raytracing loop, I will probably clean that up so these things are easier to do.

The switch case is indeed not optimal (and actually this is more a theoretical discussion since I havent really implemented it this way) but since we are calling a virtual function in our case that does many thousand instructions it should not be the limiting factor. Memory access is still what makes the largest difference.

bronxzv · ‎01-26-2011

In some of my kernels branch prediction missesare a strong limiter for SSE to AVX scalability,branches are unavoidable for optimized code even insome vectorized loops (in some of my cases at least), i.e. the timings are worse with 100% branch elimination,vs 90+% branch eliminationand still a few hard to predict branches

it was going a bit like this in a today's experiment (best performance = 100)

variant 1: branch elimination + a few branches
SSE 87
AVX-256 100
SSE to AVX-256 speedup = 15%

variant 2 : 100% branch elimination
SSE 72
AVX-25695
SSE to AVX-256 speedup = 32%

so the variant with the best scalability will be fine to put AVX in good light (even if 32% for perfectly vectorized code is deceptive) but I had to choose the 1st variant with 15% speedup

levicki · ‎01-26-2011

@michael:

When I said that, I incorrectly assumed that you might be over-optimizing by writing everything manually with SSE intrinsics -- I apologize if that offended you, but I have seen people doing it in the past, so I wanted to eliminate that as an option. Unfortunately, it turned out clumsy.

Keep in mind that I was not aware of the amount of serial .vs. parallel code in your application nor that you are writing shading functions on a CPU simply because you did not tell us exactly what you are doing.

Without more details about your project, the only option we had was to guess what might be appropriate solution for you.

To summarize -- before I offended you with my assumption I suggested the following:

1. Using three operand syntax when you are already writing AVX code
2. Checking optimization reference manual on mixing SSE and AVX code
3. Writing two versions of critical functions, and using CPU dispatching feature of Intel Compiler
4. Re-evaluating Intel Compiler as an option

Out of those 4 suggestions only suggestion #3 cannot be applied, and only because in your code all functions seem to be critical.

In my opinion, all those suggestions were quite reasonable given the circumstances, and definitely not the reason to get so worked up.

I will now go and stand in my corner for a while.

michaelnikelsky1 · ‎01-27-2011

You are right, I should have been more specific.

I am using intrinsics only, so I hope the compiler will figure out when it can use the 3 operand syntax if I set the AVX Compiler Flag.

The problem with #3 for us is that the whole raytracer is designed to work on floating point/ integer data with a specified packet width, so every function just always assumes it can work on multiple values in parallel. Terminated rays are masked out and updates are done accordingly when necessary. Therefore dispatching would mean to dispatch the whole raytracer, which has about 300 source files at the moment.

#4 is really something I would like to try but the last time we tried I just gave up after 5 days because we couldnt get everything to compile and even where not been able to figure out what exactly the problem was. Maybe we will give it another try at some time in the future, but for the moment we will use VC2010 and GCC only.

levicki · ‎01-27-2011

I am glad that we now understand each other.

If you decide to re-evaluate Intel C++ Compiler and if you still have issues, do not forget that in addition to Premier Support where you can submit your issues, there is a compiler forum here as well with a lot of experts always willing to help.