I 'am trying to "port" my java special functions class to pure x86 assembly.In my project i use SSE and SSE2instrustion setoperating on fp REAL4 values.I would like to use movaps instruction because of timing (less cpi than movups),but my program crashes with "access violation" error.While debugging i have found thanerror is caused by movaps instruction trying to access stack values local to the procedure(addressed by ebp-n) ebp is decremented by multiplies of 16.When i use movups the problem is absent.I tried to add align 16 directive but it does not work , so i'am stuck to less efficient instruction.
Here is my code snippet which calculates a few term of e^x taylor expansion.
[bash] movaps xmm0,one ;movaps works perfectly while accessing memory addps xmm0,argument ;1+x xmm0 accumulator mov eax,OFFSET coef1 movaps xmm1,[eax] rcpps xmm2,xmm1 ;1/coef1 movaps xmm3,argument mulps xmm3,xmm3 ;x^2 movups [ebp-16],xmm3 ;store x^2 ;here movaps crashes program mulps xmm2,xmm3 addps xmm0,xmm2 ;1+x+x^2/2! xmm0 accumulator mov eax,OFFSET coef2 movups xmm1,[eax] rcpps xmm2,xmm1 ;1/coef2 movups xmm7,argument movups xmm3,[ebp-16] mulps xmm3,xmm7 ;x^3 movups [ebp-32],xmm3 ;store x^3 mulps xmm2,xmm3 addps xmm0,xmm2 ;1+x+x^2/2!+x^3/3! xmm0 accumulator
short answer: you don't need to bother about MOVAPS vs MOVUPS loads /stores
long answer: although you could make an efforts to align your stack (e.g. adding AND EBP, 0xfffffff0), MOVUPS has been as fast as MOVAPS for 4 generations of Intel CPU's now, you are only really penalized when store/load crosses page boundaries (relatively rare case); also stores and subsequent loads from stack are handled by a shortcut called store-to-load forwarding mechanism without cache interaction. Perf bottlenecks are most certainly elsewhere for this code.