Solved: 3D engine - Page 6

Anonymous · ‎09-02-2014

Hello there,

ok, here we go, I have a dream, make a 3D engine 100% assembler intel only with CPU, I use rotation matrix only for now.

it works of course, but it's slow when I put a lot of pixels.

Recently I decided to include voxels in my engine, and it's slow when I put> = 8000 voxels (20 * 20 * 20 cube) and when I saw that nvidia display 32M voxels (fire) I wonder how they can do it !

And I have a little idea of the reason: MMU, paging, segmentation. memory.

Am I right?

Another question, is the FPU is the slowest to compute floating point than SSE or depending of data manipulate ?

PS: I work without OS like Windows or Linux, I run on my own kernel + bootloader in assembly too with NASM.

Sorry if i don't wirte a good english, i'm french and use google translate ^-^

Bradley_W_Intel · ‎09-02-2014

You clearly are using the processor in a very advanced way. I will do my best to answer your questions:

1) Why is your voxel engine not able to efficiently render as many voxels as you'd like? Voxel engines need to maximize their use of parallelism (both threading and SIMD) and also to store the data efficiently in an octree or some other structure that can handle sparse data. If you are doing all these things and still not getting the performance you expect, it's an optimization problem. Some Intel tools like VTune Performance Analyzer are excellent for performance analysis.

2) Is single data floating point math faster than SIMD (if I understood you)? Typically SIMD will be faster than single data instructions if your data is laid out in a way that supports the SIMD calls. In all cases, the only way for you to know for certain which way is faster is to test it.

3) How can you select between discrete and processor graphics? DirectX has methods of enumerating adapters. In such a case, the processor graphics is listed separately from the discrete graphics. If you are choosing your adapter based on the amount of available memory, you may be favoring the processor graphics when you didn't intend to. Intel has sample code that shows how to properly detect adapters in DirectX at https://software.intel.com/en-us/vcsource/samples/gpu-detect. The process for OpenGL is not well documented.

4) Can I use one processor to control execution of a second processor? Probably not. The details on Intel processors are covered at http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html. It's possible, though unlikely, that you'll be able to find something in there that can help you.

View solution in original post

Bernard · ‎09-26-2014

shaynox s. wrote:

I trying asm in line, and have error: Unknown opcode DD in asm instruction, with:
void __declspec(naked) make_rotation()
{
	// Naked functions must provide their own prolog...
	__asm{
				translation:		dd  0
                               ...
		  }
}

IIRC in inline assembly block you cannot use MASM or other declaration directive. Variables must be declared before entry to _asm block.

http://msdn.microsoft.com/en-us/library/5sds75we.aspx

http://msdn.microsoft.com/en-us/library/4ks26t93.aspx

Bernard · ‎09-26-2014

>>>ok, and for declaration of data, i don't use C declaration, because icl don't align data, it put them randomly unfortunnaly.>>>

It depends on data type. Usually when working with the array data type compiler will lay it out linearly.

Bernard · ‎09-26-2014

Please download AVX Cloth rendering source and also read included pdf document about the SoA optimization.

https://software.intel.com/en-us/articles/soa-cloth-simulation-with-256-bit-intel-advanced-vector-extensions-intel-avx

Anonymous · ‎09-26-2014

Back, mysteriously if i put asm in line, i fall down at 70 fps (250 originaly)

	static float translation[5] = { 0,
									0,		// translation_x
									0,		//translation_y
									0,		// translation_z
									0		// reserved: color
								  };
	float angle = 0;
	float rotation_x = 0;
	static float _xmm0[6] = { 0,			// sin.x	0
							  1.0,			// cos.x	4
							  1.0,			// cos.x	8
							  0,			// sin.x	12
							  1.0,			// 1		16	
							  0			// sin.y	20				
							 };
	float rotation_z = 0;
	static float _xmm1[6] = { 0,			// sin.z	0
							  1.0,			// cos.z 4
							  1.0,			// cos.z	8
							  0,			// sin.z	12	
							  1.0,			// 1		16
							  0,			// sin.y	20					
							 };
	float rotation_y = 0;
	static float _xmm2[6] = { 0, 			// sin.y	0
							  1.0,			// cos.y 4
							  1.0,			// cos.y 8
							  0,			// sin.y	12	
							  1.0,			// 1		16
							  0			// sin.y	20
							 };
	float color_pixel = 0;
	float coordonee[5] = { x,		// x
						   y,		// y
						   z,		// z
						   0,		// reserved: color
						 };
	float end_coord[3] = { 0,		// _x
						   0,		// _y
						   0		// _z
						 };
	float conv_signe = -0.0;
	float rapport = RAPPORT;
	
	// x
	_xmm0[0] = sin(DEG2RAD(rotation_object[0]));
	_xmm0[3] = sin(DEG2RAD(rotation_object[0]));
	_xmm0[5] = sin(DEG2RAD(rotation_object[0]));

	_xmm0[1] = cos(DEG2RAD(rotation_object[0]));
	_xmm0[2] = cos(DEG2RAD(rotation_object[0]));

	// z
	_xmm1[0] = sin(DEG2RAD(rotation_object[2]));
	_xmm1[3] = sin(DEG2RAD(rotation_object[2]));
	_xmm1[5] = sin(DEG2RAD(rotation_object[2]));

	_xmm1[1] = cos(DEG2RAD(rotation_object[2]));
	_xmm1[2] = cos(DEG2RAD(rotation_object[2]));

	// y
	_xmm2[0] = sin(DEG2RAD(rotation_object[1]));
	_xmm2[3] = sin(DEG2RAD(rotation_object[1]));
	_xmm2[5] = sin(DEG2RAD(rotation_object[1]));

	_xmm2[1] = cos(DEG2RAD(rotation_object[1]));
	_xmm2[2] = cos(DEG2RAD(rotation_object[1]));
	
	__asm{
		// == == == == == == =
		// yaw
		// == == == == == == =
	Yaw:        // y
		// On applique la rotation au point | [esi + 0] = x
		// | [esi + 4] = y
		// | [esi + 8] = z
		// On calcule x = x.cos(phi.y) * cos(phi.z) - y.cos(phi.y) * sin(phi.z) - z.sin(phi.y)
		//
		// On calcule  A = x.cos(phi.y), B = y.cos(phi.y) et C = z.sin(phi.y)
		movups	xmm0, [_xmm2 + 4]
			movups	xmm1, [coordonee]
			mulps	xmm0, xmm1

			// On calcule D = A * cos(phi.z), E = B * sin(phi.z) et C = C * 1
			movups	xmm1, [_xmm1 + 8]
			mulps	xmm0, xmm1

			// On calcule F = D - E, C = C - 0
			hsubps	xmm0, xmm0

			// On calcule xmm0 = F - C
			hsubps	xmm0, xmm0

			// On modifie x selon selon le rapport entre x et y pour que x soit proportionnelle à y
			movd	xmm1, [rapport]
			divps	xmm0, xmm1

			// On save la new coordonée
			movd	[end_coord], xmm0

			// == == == == == == =
			// / yaw
			// == == == == == == =

			// == == == == == == =
			// pitch
			// == == == == == == =
		Pitch:        // x
		// On applique la rotation au point | [esi + 0] = x
		// | [esi + 4] = y
		// | [esi + 8] = z
		// On calcule y = x.(cos(phi.x) * sin(phi.z) - sin(phi.x) * cos(phi.z) * sin(phi.y)) +
		//				 y.(sin(phi.x) * sin(phi.z) * sin(phi.y) + cos(phi.x) * cos(phi.z)) -
		//				 z.(sin(phi.x) * cos(phi.y))
		//
		// On calcule A = cos(phi.x) * sin(phi.z), B = sin(phi.x) * cos(phi.z), E = cos(phi.x) * cos(phi.z) et F = sin(phi.x) * sin(phi.z)
		movddup xmm0, [_xmm0 + 8]
			movups 	xmm1, [_xmm1]
			mulps	xmm0, xmm1

			// on sauve xmm0 dans xmm7 pour le copier dans xmm0 de Roll car l'equation de y ressemblent a l'equation de z mis a part que la valeur sin(phi.y) est
			// multiplié par d'autres equations

			// On calcule C' = A' * sin(phi.y) et G' = E' * sin(phi.y)
			movddup	xmm7, [_xmm2 + 12]
			mulps	xmm7, xmm0

			// On calcule C = B * sin(phi.y) et G = F * sin(phi.y)
			movddup	xmm2, [_xmm2 + 16]
			mulps	xmm0, xmm2

			// Copie le contenu du haut(64..127) d'un paquet de valeurs réel de simple précision (4*32 bits) dans sa partie basse (0..31).
			// En somme on separe les deux partie x et y : xmm0 = A) cos(phi.x) * sin(phi.z)								xmm0 = cos(phi.x) * sin(phi.z)
			//											 			C) sin(phi.x) * cos(phi.z) * sin(phi.y) = > sin(phi.x) * sin(phi.y) * cos(phi.z)
			//														E) cos(phi.x) * cos(phi.z)								xmm1 = cos(phi.x) * cos(phi.z)
			//														G) sin(phi.x) * sin(phi.z) * sin(phi.y)							sin(phi.x) * sin(phi.y) * sin(phi.z)
			movhlps xmm1, xmm0

			// On calcule D = A - C
			hsubps xmm0, xmm0

			// On calcule H = E + G
			haddps xmm1, xmm1

			// On calcule sin(phi.x) * cos(phi.y) et cos(phi.x) * cos(phi.y)
			//
			// On calcule I.roll = cos(phi.x) * cos(phi.y) et I.Pitch = sin(phi.x) * cos(phi.y)
			movlps		xmm3, [_xmm0 + 8]
			movlps		xmm2, [_xmm2 + 4]
			mulps		xmm2, xmm3
			movshdup 	xmm3, xmm2
			// On calcule x.D + y.H - z.I
			//
			// On calcule J = x.D, K = y.H et L = z.I
			movups		xmm5, [coordonee]
			movsldup	xmm4, xmm1        // y.H
			movss		xmm4, xmm0        // x.D
			movlhps 	xmm4, xmm3        // z.I.Pitch
			mulps		xmm4, xmm5

			// On calcule M = J + K
			haddps	xmm4, xmm4

			// On calcule N = M - L
			hsubps	xmm4, xmm4

			// On save la new coordonée
			movd	[end_coord+4], xmm4

			// == == == == == == =
			// / pitch
			// == == == == == == =
			// == == == == == == =
			// roll
			// == == == == == == =
		Roll:        // z
		// On applique la rotation au point | [esi + 0] = x
		// | [esi + 4] = y
		// | [esi + 8] = z
		// On calcule z' = x.(cos(phi.x) * cos(phi.z) * sin(phi.y) + sin(phi.x) * sin(phi.z)) +
		//				  y.(sin(phi.x) * cos(phi.z) - cos(phi.x) * sin(phi.z) * sin(phi.y)) +
		//				  z.(cos(phi.x) * cos(phi.y))
		//
		// Copie le contenu du haut(64..127) d'un paquet de valeurs réel de simple précision (4*32 bits) dans sa partie basse (0..31).
		// En somme on separe les deux partie x et y : xmm7 = C') cos(phi.x) * sin(phi.z) * sin(phi.y)				xmm7 =	C') cos(phi.x) * sin(phi.z) * sin(phi.y))
		//											 			B') sin(phi.x) * cos(phi.z)						 =>				B') sin(phi.x) * cos(phi.z)
		//														G') cos(phi.x) * cos(phi.z) * sin(phi.y)				xmm1 =	G') cos(phi.x) * cos(phi.z) * sin(phi.y)
		//														F') sin(phi.x) * sin(phi.z)										F') sin(phi.x) * sin(phi.z
		movhlps xmm1, xmm7

			// On calcule D' = -B' + C'
			movd	xmm6, [conv_signe]
			orps	xmm7, xmm6
			haddps	xmm7, xmm7

			// On calcule H' = G' + F'
			haddps	xmm1, xmm1

			// On calcule x.D' + y.H' + z.I'
			//
			// On calcule J = x.D', K = y.H' et L = z.I'
			movups		xmm3, [coordonee]
			movsldup	xmm4, xmm7        // y.D'
			movss		xmm4, xmm1        // x.H'
			movlhps 	xmm4, xmm2        // z.I'
			mulps		xmm4, xmm3

			// On calcule M' = J' + K'
			haddps	xmm4, xmm4

			// On calcule N' = M' + L'
			haddps	xmm4, xmm4
			movd	[end_coord+8], xmm4
			// == == == == == == =
			// / roll
			// == == == == == == =
	}

Anonymous · ‎09-26-2014

I have a problem with two function, what do you see like difference between those code:

	// x
	_xmm0[0] = sin(DEG2RAD(rotation_object[0]));
	_xmm0[3] = sin(DEG2RAD(rotation_object[0]));
	_xmm0[5] = sin(DEG2RAD(rotation_object[0]));

	_xmm0[1] = cos(DEG2RAD(rotation_object[0]));
	_xmm0[2] = cos(DEG2RAD(rotation_object[0]));

	// z
	_xmm1[0] = sin(DEG2RAD(rotation_object[2]));
	_xmm1[3] = sin(DEG2RAD(rotation_object[2]));
	_xmm1[5] = sin(DEG2RAD(rotation_object[2]));

	_xmm1[1] = cos(DEG2RAD(rotation_object[2]));
	_xmm1[2] = cos(DEG2RAD(rotation_object[2]));

	// y
	_xmm2[0] = sin(DEG2RAD(rotation_object[1]));
	_xmm2[3] = sin(DEG2RAD(rotation_object[1]));
	_xmm2[5] = sin(DEG2RAD(rotation_object[1]));

	_xmm2[1] = cos(DEG2RAD(rotation_object[1]));
	_xmm2[2] = cos(DEG2RAD(rotation_object[1]));

		fld		dword ptr [rotation_object + 0]	// st0 = x
			// On convertit l'angle degre en radians: pi/180 * (angle en degré)
			fmul	dword ptr[pi_180]

			// On calcule les nouveaux sin et cos de l'angle de l'objet
			fsincos
				// sin
					fst 	dword ptr[_xmm0 + 0]		// st0
					fst 	dword ptr[_xmm0 + 12]
					fstp 	dword ptr[_xmm0 + 20]
				// cos
					fst 	dword ptr [_xmm0 + 4]		// st1
					fstp 	dword ptr [_xmm0 + 8]

		fld		dword ptr[rotation_object + 8]	// st0 = z
			// On convertit l'angle degre en radians: pi/180 * (angle en degré)
			fmul	dword ptr[pi_180]

			// On calcule les nouveaux sin et cos de l'angle de l'objet
			fsincos
				// sin
					fst 	dword ptr[_xmm1 + 0]		// st0
					fst 	dword ptr[_xmm1 + 12]
					fstp 	dword ptr [_xmm1 + 20]
				// cos
					fst 	dword ptr[_xmm1 + 4]		// st1
					fstp 	dword ptr[_xmm1 + 8]

		fld		dword ptr[rotation_object + 4]	// st0 = y
			// On convertit l'angle degre en radians: pi/180 * (angle en degré)
			fmul	dword ptr[pi_180]

			// On calcule les nouveaux sin et cos de l'angle de l'objet
			fsincos
				// sin
					fst 	dword ptr[_xmm2 + 0]		// st0
					fst 	dword ptr [_xmm2 + 12]
					fstp 	dword ptr [_xmm2 + 20]
				// cos
					fst 	dword ptr [_xmm2 + 4]		// st1
					fstp 	dword ptr [_xmm2 + 8]

For me, it's the same, but the one do correctly rotation, and other do correct rotation but inverse axes, i tried to swap the storage to x, y and z on them but don't work.

Anonymous · ‎09-26-2014

Well, it's not fault to my asm code, but it's the icl fault apparently, i explain i have try with only one asm code:

	__asm{
		mov		rax, 2
	    }

and it's fall down to 90 fps :/

Bernard · ‎09-27-2014

>>>Back, mysteriously if i put asm in line, i fall down at 70 fps (250 originaly)>>>

It is very hard to tell exactly what has happend hence I would like to advise downloading and trying Intel VTune profiler. You run various versions of your rotation code under VTune and post screenshots here.

https://software.intel.com/en-us/intel-vtune-amplifier-xe/try-buy

Bernard · ‎09-27-2014

shaynox s. wrote:

Well, it's not fault to my asm code, but it's the icl fault apparently, i explain i have try with only one asm code:
	__asm{
		mov		rax, 2
	    }
and it's fall down to 90 fps :/

I do not understand what do you mean?

Bernard · ‎09-27-2014

static float _xmm0[6] = { 0,            // sin.x    0
010
                          1.0,          // cos.x    4
011
                          1.0,          // cos.x    8
012
                          0,            // sin.x    12
013
                          1.0,          // 1        16 
014
                          0         // sin.y    20             
015
                         };

When working with auto-vectorizing compiler try to declare statically allocated float arrays with the size of XMM or YMM registers.

static float xmm0[4] = { 0.0f,0.0f,0.0f,0.0f};

static float ymm0[8] = {0.0f,0.0f,0.0f,0.0f,0.0f,0.0f,0.0f,0.0f};

Use also "f" float keyword and fill your array with 0.f values in order to force compiler at compile time to calculate float values.

Anonymous · ‎09-27-2014

I decide to manually vectorize, need too much knowledgement: aligned memory, option compiler, intrinsic function for auto vectorization.

And for that i come back to nasm, sorry :/

For the problem of speed about integration asm inline, if i put only one __asm mov rax, 2 for exemple, the program is being slowly.

I don't know if it's true generally, cause to a flag compiler maybe, try it for compare the speed.

Bernard · ‎09-27-2014

>>>I decide to manually vectorize, need too much knowledgement: aligned memory, option compiler, intrinsic function for auto vectorization>>>

Actually you need Intel compiler for auto-vectorization. For memory alignment you can use _mm_malloc() intrinsic function which is allocating memory aligned on 64-byte boundaries.

Not always you will be able to vectorize manually.

Bernard · ‎09-28-2014

Please read following pdf about the auto-vectorization https://software.intel.com/en-us/articles/a-guide-to-auto-vectorization-with-intel-c-compilers

For memory alignment please read this http://msdn.microsoft.com/en-us/library/ms253949(VS.80).aspx

Anonymous · ‎10-04-2014

If i understand, intel compiler vectorize only for loop ?

Anonymous · ‎10-04-2014

I have build one exemple of VecSamples.zip, it's from sim2.cpp

void vec_copy(float *dest, float *src, int len)
{
    float ii;

#pragma simd
    for (int i = 0, ii = 0.0f; i < len; i++)
        dest = src * ii++;
}

And here the assembly code:

.B1.3::                         ; Preds .B1.1 .B1.3
        vxorps    xmm0, xmm0, xmm0                              ;20.22
        vcvtsi2ss xmm0, xmm0, eax                               ;20.22
        vmulss    xmm1, xmm0, DWORD PTR [r9+rdx*4]              ;20.22
        inc       eax                                           ;20.22
        vmovss    DWORD PTR [rcx+rdx*4], xmm1                   ;20.3
        inc       rdx                                           ;19.38
        cmp       rdx, r8                                       ;20.3
        jl        .B1.3         ; Prob 82%                      ;20.3

Always this -ss sufix :/

Bernard · ‎10-04-2014

>>>Always this -ss sufix :/>>>

It seems that ICC did not vectorize the code. Try to use restrict keyword at least. Did you allocate dest and src aligned on 128-bit (4*float) usually failure to vectorize the code can be blamed on the array stride which is not exactly n*4*float.

Bernard · ‎10-05-2014

Did try to follow this artice https://software.intel.com/en-us/articles/a-guide-to-auto-vectorization-with-intel-c-compilers

>>>vmulss xmm1, xmm0, DWORD PTR [r9+rdx*4] >>>

Did you check with debugger if there is a vector load in xmm1 register?

Anonymous · ‎10-05-2014

.

Bernard · ‎10-05-2014

>>>Always this -ss sufix :/>>>

Can you post ICC vectorization report?

Anonymous · ‎10-05-2014

Finnaly, icc have just vectorized few portion of code:

;;; 		*(object + _x) *= size_scale;    //x

        vmovss    xmm4, DWORD PTR [r8]                          ;222.5
$LN1366:
        vinsertps xmm5, xmm4, DWORD PTR [16+r8], 16             ;222.5
$LN1367:
        vmovss    xmm4, DWORD PTR [64+r8]                       ;222.5
$LN1368:
        vinsertps xmm1, xmm5, DWORD PTR [32+r8], 32             ;222.5
$LN1369:
        vinsertps xmm5, xmm4, DWORD PTR [80+r8], 80             ;222.5
$LN1370:
        vinsertps xmm3, xmm1, DWORD PTR [48+r8], 48             ;222.5
$LN1371:
        vinsertps xmm1, xmm5, DWORD PTR [96+r8], 96             ;222.5
$LN1372:
        vinsertps xmm4, xmm1, DWORD PTR [112+r8], 112           ;222.5
$LN1373:
        vinsertf128 ymm1, ymm3, xmm4, 1                         ;222.5
$LN1374:
        vmulps    ymm3, ymm1, ymm0                              ;222.5

The report speaks only about optimization loop, it is not intended to apply to my calculated rotation, i have read pdf and it's impossible to vectorize code out of loop, he talk only about optimisation loop .

Anonymous · ‎10-05-2014

I want talk about function sin and cos doing by intel compiler, i have try to know how he do it, and i have see this:

call      __libm_sse2_sincos

Like i know fsincos, i ask to me, what this function do, and like i don't find source code, i dissassembly this code:

And i saw this:

sub 		rsp, 68
	movaps		ss:[rsp+0x40], xmm7
	movaps		ss:[rsp+0x30], xmm6
	movsd		ss:[rsp+0x70], xmm0
	pextrw 		eax, xmm0, 3
	and 		ax, 0x7FFF
	sub 		ax, 0x3030
	cmp 		ax, 0x10C5
	ja libmmd.1800FDC1F
	;{
		unpcklpd	xmm0, xmm0
		movapd 		xmm1, ds:[0x18020EB70]
		mulpd		xmm1, xmm0
		movapd		xmm2, ds:[0x18020EB60]
		cvtsd2si	edx, xmm1
		addpd		xmm1, xmm2
		movapd		xmm3, ds:[0x18020EB50]
		subpd		xmm1, xmm2
		movapd		xmm2, ds:[0x18020EB40]
		mulpd		xmm3, xmm1
		add			rdx, 0x1C7600
		movapd		xmm4, xmm0
		and			rdx, 0x3F
		movapd		xmm5, ds:[0x18020EB30]
		lea			rax, qword  ds:[0x18020D9E0]
		shl			rdx, 6
		add 		rax, rdx
		mulpd		xmm2, xmm1
		subpd 		xmm0, xmm3
		mulpd 		xmm1, ds:[0x18020EB20]
		subpd 		xmm4, xmm3
		movapd 		xmm7, ds:[rax+0x10]
		movapd	 	xmm3, xmm4
		subpd 		xmm4, xmm2
		mulpd		xmm5, xmm0
		subpd 		xmm0, xmm2
		movapd 		xmm6, ds:[0x18020EB10]
		mulpd 		xmm7, xmm4
		subpd		xmm3, xmm4
		mulpd 		xmm5, xmm0
		mulpd 		xmm0, xmm0
		subpd 		xmm3, xmm2
		movapd 		xmm2, ds:[rax]
		subpd 		xmm1, xmm3
		movapd 		xmm3, ds:[rax+0x30]
		addpd 		xmm2, xmm3
		subpd 		xmm7, xmm2
		mulpd 		xmm1, xmm7
		movapd 		xmm7, ds:[rax+0x10]
		mulpd 		xmm2, xmm4
		mulpd 		xmm6, xmm0
		mulpd 		xmm3, xmm4
		mulpd 		xmm2, xmm0
		mulpd 		xmm7, xmm0
		mulpd 		xmm0, xmm0
		addpd 		xmm5, ds:[0x18020EB00]
		mulpd 		xmm4, ds:[rax]
		addpd		xmm6, ds:[0x18020EAF0]
		mulpd		xmm5, xmm0
		movapd 		xmm0, xmm3
		addpd 		xmm3, ds:[rax+0x10]
		addpd		xmm6, xmm5
		movq 		xmm5, xmm6
		unpckhpd	xmm6, xmm6
		unpcklpd	xmm5, xmm5
		mulpd		xmm6, xmm7
		mulpd 		xmm2, xmm5
		movapd 		xmm7, xmm4
		addpd		xmm4, xmm3
		movapd		xmm5, ds:[rax+0x10]
		subpd 		xmm5, xmm3
		subpd		xmm3, xmm4
		addpd 		xmm1, ds:[rax+0x20]
		addpd 		xmm5, xmm0
		addpd 		xmm3, xmm7
		addpd 		xmm1, xmm5
		addpd 		xmm1, xmm3
		addpd 		xmm1, xmm2
		addpd 		xmm1, xmm6
		addpd 		xmm1, xmm4
		movq 		xmm0, xmm1
		unpckhpd 	xmm1, xmm1
	;} jmp libmmd.1800FDE1E
	; ... Prépare les donnée a etre traitée, a mon avis.
	libmmd.1800FDC1F:
	; ... Contient autant d'instruction SIMD que celles du 1st bloc.
	ret

Later i have see another function who look like a little bit same in this url:

https://github.com/mario007/renmas/blob/master/renmas3/asm/sincosps.py

I have transfer this code in my project (SDL asm), but my fps fall down, 20 less, relative of fsincos, is it normal ?

	;=============================================================================================================
	 ; float sin[4], cos[4] sincosps (float angle_radians[4])
	 ; Calcule les fonctions sin et cos des 4 angles contenu dans angle_radians[4].
	 ; Entrée : angle_radians[4]
	 ; Sotie: sin[4] et cos[4]
	 ; Destroyed: ebx - edx - ebp
	 ; DATA:
			_ps_am_inv_sign_mask	dd	0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF
			_ps_am_sign_mask		dd	0x80000000, 0x80000000, 0x80000000, 0x80000000
			_ps_am_pi_o_2			dd	1.57079632679, 1.57079632679, 1.57079632679, 1.57079632679
			_ps_am_2_o_pi			dd	0.63661977236, 0.63661977236, 0.63661977236, 0.63661977236
			_epi32_1				dd	1.0, 1.0, 1.0, 1.0
			_ps_am_1 				dd	1.0, 1.0, 1.0, 1.0
			_epi32_2 				dd	2.0, 2.0, 2.0, 2.0
			_ps_sincos_p3			dd	-0.00468175413, -0.00468175413, -0.00468175413, -0.00468175413
			_ps_sincos_p2 			dd	0.0796926262, 0.0796926262, 0.0796926262, 0.0796926262
			_ps_sincos_p1 			dd	-0.64596409750621,-0.64596409750621,-0.64596409750621,-0.64596409750621
			_ps_sincos_p0 			dd	1.570796326794896, 1.570796326794896, 1.570796326794896, 1.570796326794896
			
	;=============================================================================================================
	sincosps:
			movups 		xmm7, [sincosps_angle_radians]
			movups		xmm1, [_ps_am_inv_sign_mask]
		andps 		xmm0, xmm1
		movups		xmm1, [_ps_am_sign_mask]
		andps 		xmm7, xmm1
		movups		xmm1, [_ps_am_2_o_pi]
		mulps 		xmm0, xmm1
		pxor 		xmm3, xmm3
		movups 		xmm5, [_epi32_1]
		movups 		xmm4, [_ps_am_1]
		cvttps2dq 	xmm2, xmm0
		pand 		xmm5, xmm2
		pcmpeqd		xmm5, xmm3
		movups 		xmm3, [_epi32_1]
		movups 		xmm1, [_epi32_2]
		cvtdq2ps 	xmm6, xmm2
		paddd 		xmm3, xmm2
		pand		xmm2, xmm1
		pand 		xmm3, xmm1
		subps		xmm0, xmm6
		pslld 		xmm2, 30
		minps 		xmm0, xmm4
		subps 		xmm4, xmm0
		pslld 		xmm3, 30
		movaps 		xmm6, xmm4
		xorps 		xmm2, xmm7
		movaps		xmm7, xmm5
		andps 		xmm6, xmm7
		andnps 		xmm7, xmm0
		andps 		xmm0, xmm5
		andnps		xmm5, xmm4
		movups 		xmm4, [_ps_sincos_p3]
		orps 		xmm6, xmm7
		orps 		xmm0, xmm5
		movups 		xmm5, [_ps_sincos_p2]
		movaps 		xmm1, xmm0
		movaps 		xmm7, xmm6
		mulps 		xmm0, xmm0
		mulps 		xmm6, xmm6
		orps 		xmm1, xmm2
		orps 		xmm7, xmm3
		movaps 		xmm2, xmm0
		movaps 		xmm3, xmm6
		mulps 		xmm0, xmm4
		mulps 		xmm6, xmm4
		movups 		xmm4, [_ps_sincos_p1]
		addps 		xmm0, xmm5
		addps 		xmm6, xmm5
		movups 		xmm5, [_ps_sincos_p0]
		mulps 		xmm0, xmm2
		mulps 		xmm6, xmm3
		addps 		xmm0, xmm4
		addps 		xmm6, xmm4
		mulps 		xmm0, xmm2
		mulps 		xmm6, xmm3
		addps 		xmm0, xmm5
		addps 		xmm6, xmm5
		mulps 		xmm0, xmm1
		mulps		xmm6, xmm7
		movups		[sincosps_sin], xmm0 	; sinus(xmm0)
		movups		[sincosps_cos], xmm6 	; cosinus(xmm0)
		
		; to add in put_object
			movups		xmm0, [deg_rotation_x]
			movups		xmm1, [pi_180]
			mulps		xmm0, xmm1
			movups		[sincosps_angle_radians], xmm0
		call	sincosps
		
		movups		xmm1, [sincosps_cos]
		movsldup	xmm0, xmm1
		movsd 		[_xmm0 + 4], xmm0			; save cos(x)
		movhps 		[_xmm2 + 4], xmm0			; save cos(y)
		movups		xmm1, [sincosps_cos + 8]
		movsldup	xmm0, xmm1
		movsd		[_xmm1 + 4], xmm0			; save cos(z)
		
		; save sin(x)
			movups		xmm0, [sincosps_sin]
			movss		[_xmm0 + 0], xmm0
			movss		[_xmm0 + 12], xmm0
			movss		[_xmm0 + 20], xmm0
		
		; save sin(y)
			movups		xmm0, [sincosps_sin + 4]
			movss		[_xmm2 + 0], xmm0
			movss		[_xmm2 + 12], xmm0
			movss		[_xmm2 + 20], xmm0

		; save sin(z)
			movups		xmm0, [sincosps_sin + 8]
			movss		[_xmm1 + 0], xmm0
			movss		[_xmm1 + 12], xmm0
			movss		[_xmm1 + 20], xmm0
	ret
	;=============================================================================================================
	; / sincosps
	;=============================================================================================================

I saw in optimization reference manual, fsincos have 119 latency, and latency's instruction of SIMD are around 5-6.

Latency is like clock cycle ?

Bernard · ‎10-06-2014

>>>Finnaly, icc have just vectorized few portion of code:>>>

Yes that portion of the code was vectorized. You can see also that loop was 6x unrolled. I suppose that auto-vectorizer will try to vectorize major code hotspots which are mainly loops. I did not try to vectorize code which is not running inside the loop. Regarding that piece of vectorized loop disassembly you can see that compiler performed stride checks in order to calculate its length.

If you have multiple arrays loaded with doubles you can unroll 8x the loop and issue prefetch for every array per cycle.

Pseudocode:

for(unsigned int i = 0; i < Len; i +=8 )

{

prefetch(array1+x) // prefetch distance x should be calculated by trial and error

array1 // do some operation

array2 // do some operation

array3 // do some operation

array4 // do some operation

array5 // do some operation

prefetch(array2+x) // // prefetch distance x should be calculated by trial and error

array1[i+1] // do some operation

array2[i+1] // do some operation on array

array3[i+1] // do some operation on array

array4[i+1] // do some operation on array

array5[i+1] // do some operation on array

............................... code continues

array1[i+7] // do some operation on array

array2[i+7] // do some operation on array

array3[i+7] // do some operation on array

array4[i+7] // do some operation on array

array5[i+7] // do some operation on array

}