Solved: 3D engine - Page 7

Anonymous · ‎09-02-2014

Hello there,

ok, here we go, I have a dream, make a 3D engine 100% assembler intel only with CPU, I use rotation matrix only for now.

it works of course, but it's slow when I put a lot of pixels.

Recently I decided to include voxels in my engine, and it's slow when I put> = 8000 voxels (20 * 20 * 20 cube) and when I saw that nvidia display 32M voxels (fire) I wonder how they can do it !

And I have a little idea of the reason: MMU, paging, segmentation. memory.

Am I right?

Another question, is the FPU is the slowest to compute floating point than SSE or depending of data manipulate ?

PS: I work without OS like Windows or Linux, I run on my own kernel + bootloader in assembly too with NASM.

Sorry if i don't wirte a good english, i'm french and use google translate ^-^

Bradley_W_Intel · ‎09-02-2014

You clearly are using the processor in a very advanced way. I will do my best to answer your questions:

1) Why is your voxel engine not able to efficiently render as many voxels as you'd like? Voxel engines need to maximize their use of parallelism (both threading and SIMD) and also to store the data efficiently in an octree or some other structure that can handle sparse data. If you are doing all these things and still not getting the performance you expect, it's an optimization problem. Some Intel tools like VTune Performance Analyzer are excellent for performance analysis.

2) Is single data floating point math faster than SIMD (if I understood you)? Typically SIMD will be faster than single data instructions if your data is laid out in a way that supports the SIMD calls. In all cases, the only way for you to know for certain which way is faster is to test it.

3) How can you select between discrete and processor graphics? DirectX has methods of enumerating adapters. In such a case, the processor graphics is listed separately from the discrete graphics. If you are choosing your adapter based on the amount of available memory, you may be favoring the processor graphics when you didn't intend to. Intel has sample code that shows how to properly detect adapters in DirectX at https://software.intel.com/en-us/vcsource/samples/gpu-detect. The process for OpenGL is not well documented.

4) Can I use one processor to control execution of a second processor? Probably not. The details on Intel processors are covered at http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html. It's possible, though unlikely, that you'll be able to find something in there that can help you.

View solution in original post

Bernard · ‎10-06-2014

>>>I have transfer this code in my project (SDL asm), but my fps fall down, 20 less, relative of fsincos, is it normal ?>>>

Can you test speed of execution of your code with the call to __libm_sse2_sincos function?

Usually libm_sse2_sincos should be faster than x87 fsincos.

Bernard · ‎10-06-2014

>>>I saw in optimization reference manual, fsincos have 119 latency, and latency's instruction of SIMD are around 5-6.

Latency is like clock cycle ?>>>

I have that raw speed(without call and ret overhead) of libm_sse2_sincos could be less than x87 fsincos. Crude estimation of libm_sse2_sincos execution speed could be simple as calculation and summation of the all machine code instructions. Pay attention that some of the instruction can be executed in paralel or out-of-order.

Latency is the time in CPU cycles needed to proces (retire) machine code instruction. For example latency of x87 fsincos is 112 CPU cycles that's mean that 112 cycles are needed to execute that instruction. At the hardware level fsincos is broken down into various micro-ops which are scheduled to run on FP adder and mul units and accumulate the results into floating point physical registers. Horner Scheme is probably used for the result convergence.

Bernard · ‎10-06-2014

Today I will upload my own implementation of trigo instructions.

Bernard · ‎10-06-2014

Pseudocode:

for(unsigned int i = 0; i < Len; i +=8 )

 {

       prefetch(array1+x) // prefetch distance x should be calculated by trial and error

        array1 // do some operation

        array2 // do some operation

        array3 // do some operation

        array4 // do some operation

        array5 // do some operation

          prefetch(array2+x) // // prefetch distance x should be calculated by trial and error

        array1[i+1] // do some operation

        array2[i+1] // do some operation on array

        array2[i+1] // do some operation on array

        array3[i+1] // do some operation on  array

       array4[i+1] // do some operation on array

       array5[i+1] // do some operation on array

        ............................... code continues

      array1[i+7] // do some operation on array

      array2[i+7] // do some operation on array

      array3[i+7] //  do some operation on array

      array4[i+7] // do some operation on  array

      array5[i+7] // do some operation on array

}

Bernard · ‎10-06-2014

Post #130 is related to post #126.

In post #130 as you can see 8x unrolling is loading in each array 64-byte stride which can fit L1D cache line.

Anonymous · ‎10-06-2014

For difference of speed between fsincos and call __libm_sse2_sincos, here's the code test:

the generate code made by cos ans sin block, generate 3 * call __libm_sse2_sincos, that's why i put only 3 * fsincos.

I have try to calling __libm_sse2_sincos through asm inline, but he didn't find this function, and made error at compilation :/

float			rotation_object[4] = { 0 };     // 0: x , 1: y , 2: z

void        put_pixel(float *coord, int color, float* a)
{
	int		offset_pixel;

	int		a_time_sincosps, b_time_sincosps;
	int		a_time_fsincos, b_time_fsincos;
	unsigned int	loop_test;
	int		temp;

	__asm	rdtsc
	__asm	mov		[a_time_sincosps], eax

			trigo.cos[_x] = cos(DEG2RAD(rotation_object[_x]));
			trigo.cos[_y] = cos(DEG2RAD(rotation_object[_y]));
			trigo.cos[_z] = cos(DEG2RAD(rotation_object[_z]));

			trigo.sin[_x] = sin(DEG2RAD(rotation_object[_x]));
			trigo.sin[_y] = sin(DEG2RAD(rotation_object[_y]));
			trigo.sin[_z] = sin(DEG2RAD(rotation_object[_z]));

	__asm	rdtsc
	__asm	mov		[b_time_sincosps], eax
	
	printf("time_sincosps = %i\n", b_time_sincosps - a_time_sincosps);

	__asm	rdtsc
	__asm	mov		[a_time_fsincos], eax

			__asm		vxorpd    xmm1, xmm1, xmm1				
			__asm		vcvtss2sd xmm1, xmm1, [temp]	
			__asm		vmovsd    xmm15, [temp]
			__asm		vmulsd    xmm0, xmm15, xmm1				
			__asm		vzeroupper								
		__asm		fsincos

			__asm		vxorpd    xmm2, xmm2, xmm2				
			__asm		vmovapd   xmm14, xmm0					
			__asm		vcvtss2sd xmm2, xmm2, [temp]
			__asm		vcvtsd2ss xmm1, xmm1, xmm1			
			__asm		vmulsd    xmm0, xmm15, xmm2				
			__asm		vmovss[temp], xmm1
		__asm		fsincos

			__asm		vxorpd    xmm2, xmm2, xmm2		
			__asm		vmovapd   xmm13, xmm0			
			__asm		vcvtss2sd xmm2, xmm2, [temp]
			__asm		vcvtsd2ss xmm1, xmm1, xmm1
			__asm		vmulsd    xmm0, xmm15, xmm2
			__asm		vmovss    [temp], xmm1
		__asm		fsincos

		__asm		vcvtsd2ss xmm1, xmm1, xmm1
		__asm		vcvtsd2ss xmm14, xmm14, xmm14
		__asm		vcvtsd2ss xmm13, xmm13, xmm13
		__asm		vcvtsd2ss xmm0, xmm0, xmm0
		__asm		vmovss    [temp], xmm1
		__asm		vmovss    [temp], xmm14
		__asm		vmovss    [temp], xmm13
		__asm		vmovss    [temp], xmm0
	
	__asm	rdtsc
	__asm	mov		[b_time_fsincos], eax

	printf("time_fsincos = %i", b_time_fsincos - a_time_fsincos);

	while (1);
}

I have add store anc convert function because it's the code made by trigo.cos[_x] = ... and other stores.

can't do only cos(DEG2RAD(rotation_object[_x])); because he remove call __libm_sse2_sincos.

And here the result:

time_sincosps = 19839
time_fsincos = 1981

And, is it possible to modify the processus of compilation by rewrite asm file ? it will be wonderfull if it's possible, i will be able to (correct) the code or rewrite it like i think.

Finnaly is it to possible to personalize the asm code make by icl ? ex:

Keep the display of number, like hexadecimal, and write it with 0x instead H at the end of number:

return 0xdeadbeef --> (origin) mov eax, -559038737 --> (wish) mov eax, 0xdeadbeef ^^

Anonymous · ‎10-06-2014

I will try instrinsics function through this wonderfull web page: https://software.intel.com/sites/landingpage/IntrinsicsGuide/

Anonymous · ‎10-06-2014

It's too complex to work with intrinsic function, load, set, only cause he don't allowed casting float [4] to __m128 :/

"error : argument of type "float" is incompatible with parameter of type "__m128"

"error : a value of type "__m128" cannot be assigned to an entity of type "float""

"error : cast to type "__m128" is not allowed"

It's strange that the compiler don't authorized this, one more forbidding add in the way programming's black list of high level compiler :/

Bernard · ‎10-06-2014

shaynox s. wrote:

It's too complex to work with intrinsic function, load, set, only cause he don't allowed casting float [4] to __m128 :/

"error : argument of type "float" is incompatible with parameter of type "__m128"

"error : a value of type "__m128" cannot be assigned to an entity of type "float""

"error : cast to type "__m128" is not allowed"

It's strange that the compiler don't authorized this, one more forbidding add in the way programming's black list of high level compiler :/

Do not cast __m128 to float array because __m128 is a struct of unions. You can simply initialize __m128 member fields , but this is not recommended or you can cast float pointer to __m128 type pointer.

float vertex[4] = {0.f,0.f,0.f,0.f};

__m128 *ptr_m128 = (__m128 *)vertex;

http://stackoverflow.com/questions/11759791/is-it-possible-to-cast-floats-directly-to-m128-if-they-are-16-byte-alligned

http://stackoverflow.com/questions/5118158/using-sse-to-speed-up-computation-store-load-and-alignment?rq=1

Anonymous · ‎10-06-2014

Ok, but like i do most store and load, it will be a little complexe, and i don't think it will be optimize if i do with that:

	static float	_xmm0[4], _xmm1[4], _xmm2[4], _xmm3[4], _xmm4[4], _xmm7[4];

	trigo.coord = coord;
	
	trigo._xmm0[0] = sin(DEG2RAD(rotation_object[_x]));
	trigo._xmm0[1] = cos(DEG2RAD(rotation_object[_x]));
	trigo._xmm0[2] = cos(DEG2RAD(rotation_object[_x]));
	trigo._xmm0[3] = sin(DEG2RAD(rotation_object[_x]));
	trigo._xmm0[4] = 1;
	trigo._xmm0[5] = sin(DEG2RAD(rotation_object[_x]));

	trigo._xmm1[0] = sin(DEG2RAD(rotation_object[_z]));
	trigo._xmm1[1] = cos(DEG2RAD(rotation_object[_z]));
	trigo._xmm1[2] = cos(DEG2RAD(rotation_object[_z]));
	trigo._xmm1[3] = sin(DEG2RAD(rotation_object[_z]));
	trigo._xmm1[4] = 1;
	trigo._xmm1[5] = sin(DEG2RAD(rotation_object[_z]));

	trigo._xmm2[0] = sin(DEG2RAD(rotation_object[_y]));
	trigo._xmm2[1] = cos(DEG2RAD(rotation_object[_y]));
	trigo._xmm2[2] = cos(DEG2RAD(rotation_object[_y]));
	trigo._xmm2[3] = sin(DEG2RAD(rotation_object[_y]));
	trigo._xmm2[4] = 1;
	trigo._xmm2[5] = sin(DEG2RAD(rotation_object[_y]));

		// == == == == == == =
		// yaw
		// == == == == == == =
	//Yaw:; y
		// On applique la rotation au point | [esi + 0] = x
		// | [esi + 4] = y
		// | [esi + 8] = z
		// On calcule x = x.cos(phi_y) * cos(phi_z) - y.cos(phi_y) * sin(phi_z) - z.sin(phi_y)
		//
		// On calcule  A = x.cos(phi_y), B = y.cos(phi_y) et C = z.sin(phi_y)
			*_xmm0 = _mm_mul_ps(trigo._xmm2[1], trigo.coord);

		// On calcule D = A * cos(phi_z), E = B * sin(phi_z) et C = C * 1
			* _xmm0 = _mm_mul_ps(*_xmm0, trigo._xmm1[2]);

		// On calcule F = D - E, C = C - 0
			*_xmm0 = _mm_hsub_ps(*_xmm0, *_xmm0);

		// On calcule xmm0 = F - C
			*_xmm0 = _mm_hsub_ps(*_xmm0, *_xmm0);

		// On save la new coordonée
			trigo.end_coord[_x] = *_xmm0;

		// == == == == == == =
		// / yaw
		// == == == == == == =

		// == == == == == == =
		// pitch
		// == == == == == == =
	//Pitch:; x
		  // On applique la rotation au point | [esi + 0] = x
		  // | [esi + 4] = y
		  // | [esi + 8] = z
		  // On calcule y = x.(cos(phi_x) * sin(phi_z) - sin(phi_x) * cos(phi_z) * sin(phi_y)) +
		  //				 y.(sin(phi_x) * sin(phi_z) * sin(phi_y) + cos(phi_x) * cos(phi_z)) -
		  //				 z.(sin(phi_x) * cos(phi_y))
		  //
		// On calcule A = cos(phi_x) * sin(phi_z), B = sin(phi_x) * cos(phi_z), E = cos(phi_x) * cos(phi_z) et F = sin(phi_x) * sin(phi_z)
		
		//movddup		xmm0, [_xmm0 + 8]
			_xmm0[2] = trigo._xmm0[2];
			_xmm0[3] = trigo._xmm0[2];

		//movddup		xmm7, [_xmm2 + 12]
			*_xmm0 = _mm_mul_ps(*_xmm0, trigo._xmm1);

		// on sauve xmm0 dans xmm7 pour le copier dans xmm0 de Roll car l'equation de y ressemblent a l'equation de z mis a part que la valeur sin(phi_y) est
		// multiplié par d'autres equations

		// On calcule C' = A' * sin(phi_y) et G' = E' * sin(phi_y)
		
		// movddup		xmm2, [_xmm2 + 16]
			_xmm7[2] = trigo._xmm2[3];
			_xmm7[3] = trigo._xmm2[3];

			*_xmm7 = _mm_mul_ps(*_xmm7, *_xmm0);

		// On calcule C = B * sin(phi_y) et G = F * sin(phi_y)
		
		// movhlps		xmm1, xmm0
			_xmm2[2] = trigo._xmm2[4];
			_xmm2[3] = trigo._xmm2[4];

			*_xmm0 = _mm_mul_ps(*_xmm0, *_xmm2);

		// Copie le contenu du haut(64..127) d'un paquet de valeurs réel de simple précision (4*32 bits) dans sa partie basse (0..31).
		// En somme on separe les deux partie x et y : xmm0 = A) cos(phi_x) * sin(phi_z)								xmm0 = cos(phi_x) * sin(phi_z)
		//											 			C) sin(phi_x) * cos(phi_z) * sin(phi_y) = > sin(phi_x) * sin(phi_y) * cos(phi_z)
		//														E) cos(phi_x) * cos(phi_z)								xmm1 = cos(phi_x) * cos(phi_z)
		//														G) sin(phi_x) * sin(phi_z) * sin(phi_y)							sin(phi_x) * sin(phi_y) * sin(phi_z)
		
		// movhlps 	xmm1, xmm0
			_xmm1[1] = _xmm0[3];
			_xmm1[0] = _xmm0[2];
		// On calcule D = A - C
			*_xmm0 = _mm_hsub_ps(*_xmm0, *_xmm0);

		// On calcule H = E + G
			*_xmm1 = _mm_hadd_ps(_xmm1, _xmm1);

		// On calcule sin(phi_x) * cos(phi_y) et cos(phi_x) * cos(phi_y)
		//
		// On calcule I.roll = cos(phi_x) * cos(phi_y) et I.Pitch = sin(phi_x) * cos(phi_y)
		// movlps		xmm3, [_xmm0 + 8]
			_xmm3[1] = trigo._xmm0[5];
			_xmm3[0] = trigo._xmm0[4];

		// movlps		xmm2, [_xmm2 + 4]
			_xmm2[1] = trigo._xmm2[2];
			_xmm2[0] = trigo._xmm2[1];

			*_xmm2 = _mm_mul_ps(*_xmm2, *_xmm3);

		// movshdup 	xmm3, xmm2
			_xmm3[0] = _xmm2[1];
			_xmm3[1] = _xmm2[1];
			_xmm3[2] = _xmm2[3];
			_xmm3[3] = _xmm2[3];
		// On calcule x.D + y.H - z.I
		//
		// On calcule J = x.D, K = y.H et L = z.I
		// movsldup		xmm4, xmm1	; y.H
			_xmm4[0] = _xmm1[0];
			_xmm4[1] = _xmm1[0];
			_xmm4[2] = _xmm1[2];
			_xmm4[3] = _xmm1[2];

		// movss		xmm4, xmm0	; x.D
			_xmm4[0] = _xmm0[0];

		// movlhps		xmm4, xmm3; z.I.Pitch
			_xmm4[2] = _xmm3[0];
			_xmm4[3] = _xmm3[1];

			*_xmm4 = _mm_mul_ps(*_xmm4, trigo.coord);

		// On calcule M = J + K
			_xmm4 = _mm_hadd_ps(*_xmm4, *_xmm4)

		// On calcule N = M - L
			_xmm4 = _mm_hsub_ps(*_xmm4, *_xmm4)

		// On save la new coordonée
			trigo.end_coord[_y] = *_xmm4;

		// == == == == == == =
		// / pitch
		// == == == == == == =
		// == == == == == == =
		// roll
		// == == == == == == =
	//Roll:; z
		// On applique la rotation au point | [esi + 0] = x
		// | [esi + 4] = y
		// | [esi + 8] = z
		// On calcule z' = x.(cos(phi_x) * cos(phi_z) * sin(phi_y) + sin(phi_x) * sin(phi_z)) + 
		//				  y.(sin(phi_x) * cos(phi_z) - cos(phi_x) * sin(phi_z) * sin(phi_y)) +
		//				  z.(cos(phi_x) * cos(phi_y))
		//
		// Copie le contenu du haut(64..127) d'un paquet de valeurs réel de simple précision (4*32 bits) dans sa partie basse (0..31).
			// En somme on separe les deux partie x et y : xmm7 = C') cos(phi_x) * sin(phi_z) * sin(phi_y)				xmm7 =	C') cos(phi_x) * sin(phi_z) * sin(phi_y))
			//											 			B') sin(phi_x) * cos(phi_z)						 =>				B') sin(phi_x) * cos(phi_z)
			//														G') cos(phi_x) * cos(phi_z) * sin(phi_y)				xmm1 =	G') cos(phi_x) * cos(phi_z) * sin(phi_y)
			//														F') sin(phi_x) * sin(phi_z)										F') sin(phi_x) * sin(phi_z)
		// movhlps		xmm1, xmm7	
			_xmm1[0] = _xmm7[2];
			_xmm1[1] = _xmm7[3];

		// On calcule D' = -B' + C'
			or	xmm7[0], 0x80000000;
			*_xmm7 = _mm_hadd_ps(_xmm7, _xmm7);

		// On calcule H' = G' + F'
			*_xmm1 = _mm_hadd_ps(_xmm1, _xmm1);

		// On calcule x.D' + y.H' + z.I'
		//
		// On calcule J = x.D', K = y.H' et L = z.I'
		
		// movsldup		xmm4, xmm7	; y.D'	
			_xmm4[0] = _xmm7[0];
			_xmm4[1] = _xmm7[0];
			_xmm4[2] = _xmm7[2];
			_xmm4[3] = _xmm7[2];

		// movss		xmm4, xmm1	; x.H'
			_xmm4[0] = _xmm1[0];

		// movlhps 	xmm4, xmm2	; z.I'
			_xmm4[2] = _xmm2[0];
			_xmm4[3] = _xmm2[1];

			*_xmm4 = _mm_mul_ps(*_xmm4, trigo.coord);

		// On calcule M' = J' + K'
			_xmm4 = _mm_hadd_ps(*_xmm4, *_xmm4)

		// On calcule N' = M' + L'
			_xmm4 = _mm_hadd_ps(*_xmm4, *_xmm4)

		// On save la new coordonée
			trigo.end_coord[_z] = *_xmm4;

		// == == == == == == =
		// / roll
		// == == == == == == =

Bernard · ‎10-07-2014

shaynox s. wrote:

For difference of speed between fsincos and call __libm_sse2_sincos, here's the code test:

the generate code made by cos ans sin block, generate 3 * call __libm_sse2_sincos, that's why i put only 3 * fsincos.

I have try to calling __libm_sse2_sincos through asm inline, but he didn't find this function, and made error at compilation :/
float			rotation_object[4] = { 0 };     // 0: x , 1: y , 2: z

void        put_pixel(float *coord, int color, float* a)
{
	int		offset_pixel;

	int		a_time_sincosps, b_time_sincosps;
	int		a_time_fsincos, b_time_fsincos;
	unsigned int	loop_test;
	int		temp;

	__asm	rdtsc
	__asm	mov		[a_time_sincosps], eax

			trigo.cos[_x] = cos(DEG2RAD(rotation_object[_x]));
			trigo.cos[_y] = cos(DEG2RAD(rotation_object[_y]));
			trigo.cos[_z] = cos(DEG2RAD(rotation_object[_z]));

			trigo.sin[_x] = sin(DEG2RAD(rotation_object[_x]));
			trigo.sin[_y] = sin(DEG2RAD(rotation_object[_y]));
			trigo.sin[_z] = sin(DEG2RAD(rotation_object[_z]));

	__asm	rdtsc
	__asm	mov		[b_time_sincosps], eax
	
	printf("time_sincosps = %i\n", b_time_sincosps - a_time_sincosps);

	__asm	rdtsc
	__asm	mov		[a_time_fsincos], eax

			__asm		vxorpd    xmm1, xmm1, xmm1				
			__asm		vcvtss2sd xmm1, xmm1, [temp]	
			__asm		vmovsd    xmm15, [temp]
			__asm		vmulsd    xmm0, xmm15, xmm1				
			__asm		vzeroupper								
		__asm		fsincos

			__asm		vxorpd    xmm2, xmm2, xmm2				
			__asm		vmovapd   xmm14, xmm0					
			__asm		vcvtss2sd xmm2, xmm2, [temp]
			__asm		vcvtsd2ss xmm1, xmm1, xmm1			
			__asm		vmulsd    xmm0, xmm15, xmm2				
			__asm		vmovss[temp], xmm1
		__asm		fsincos

			__asm		vxorpd    xmm2, xmm2, xmm2		
			__asm		vmovapd   xmm13, xmm0			
			__asm		vcvtss2sd xmm2, xmm2, [temp]
			__asm		vcvtsd2ss xmm1, xmm1, xmm1
			__asm		vmulsd    xmm0, xmm15, xmm2
			__asm		vmovss    [temp], xmm1
		__asm		fsincos

		__asm		vcvtsd2ss xmm1, xmm1, xmm1
		__asm		vcvtsd2ss xmm14, xmm14, xmm14
		__asm		vcvtsd2ss xmm13, xmm13, xmm13
		__asm		vcvtsd2ss xmm0, xmm0, xmm0
		__asm		vmovss    [temp], xmm1
		__asm		vmovss    [temp], xmm14
		__asm		vmovss    [temp], xmm13
		__asm		vmovss    [temp], xmm0
	
	__asm	rdtsc
	__asm	mov		[b_time_fsincos], eax

	printf("time_fsincos = %i", b_time_fsincos - a_time_fsincos);

	while (1);
}
I have add store anc convert function because it's the code made by trigo.cos[_x] = ... and other stores.

can't do only cos(DEG2RAD(rotation_object[_x])); because he remove call __libm_sse2_sincos.

And here the result:
time_sincosps = 19839
time_fsincos = 1981
And, is it possible to modify the processus of compilation by rewrite asm file ? it will be wonderfull if it's possible, i will be able to (correct) the code or rewrite it like i think.

Finnaly is it to possible to personalize the asm code make by icl ? ex:

Keep the display of number, like hexadecimal, and write it with 0x instead H at the end of number:

return 0xdeadbeef --> (origin) mov eax, -559038737 --> (wish) mov eax, 0xdeadbeef ^^

The piece of code which calls library cos and sin function is expected to run slower than code which is using CPU built-in x87 fsincos machine code instruction. I would try to run both of pieces of code inside the double loop and calculate the average of the runs. When using rdtsc you must serialize the execution inside the CPU because of the nature of out-of-order processing there is a possibility that part of the code will be scheduled to run before the execution of rdtsc instruction.

Bernard · ‎10-07-2014

>>>Finnaly is it to possible to personalize the asm code make by icl ? ex:>>>

If I understood your question correctly you can load compiled exe file into IDA Pro disassembler and perform any assembly code changes after that you can recompile the code. I am not sure if this will work correctly.

Bernard · ‎10-07-2014

shaynox s. wrote:

Ok, but like i do most store and load, it will be a little complexe, and i don't think it will be optimize if i do with that:

	static float	_xmm0[4], _xmm1[4], _xmm2[4], _xmm3[4], _xmm4[4], _xmm7[4];

	trigo.coord = coord;
	
	trigo._xmm0[0] = sin(DEG2RAD(rotation_object[_x]));
	trigo._xmm0[1] = cos(DEG2RAD(rotation_object[_x]));
	trigo._xmm0[2] = cos(DEG2RAD(rotation_object[_x]));
	trigo._xmm0[3] = sin(DEG2RAD(rotation_object[_x]));
	trigo._xmm0[4] = 1;
	trigo._xmm0[5] = sin(DEG2RAD(rotation_object[_x]));

	trigo._xmm1[0] = sin(DEG2RAD(rotation_object[_z]));
	trigo._xmm1[1] = cos(DEG2RAD(rotation_object[_z]));
	trigo._xmm1[2] = cos(DEG2RAD(rotation_object[_z]));
	trigo._xmm1[3] = sin(DEG2RAD(rotation_object[_z]));
	trigo._xmm1[4] = 1;
	trigo._xmm1[5] = sin(DEG2RAD(rotation_object[_z]));

	trigo._xmm2[0] = sin(DEG2RAD(rotation_object[_y]));
	trigo._xmm2[1] = cos(DEG2RAD(rotation_object[_y]));
	trigo._xmm2[2] = cos(DEG2RAD(rotation_object[_y]));
	trigo._xmm2[3] = sin(DEG2RAD(rotation_object[_y]));
	trigo._xmm2[4] = 1;
	trigo._xmm2[5] = sin(DEG2RAD(rotation_object[_y]));

		// == == == == == == =
		// yaw
		// == == == == == == =
	//Yaw:; y
		// On applique la rotation au point | [esi + 0] = x
		// | [esi + 4] = y
		// | [esi + 8] = z
		// On calcule x = x.cos(phi_y) * cos(phi_z) - y.cos(phi_y) * sin(phi_z) - z.sin(phi_y)
		//
		// On calcule  A = x.cos(phi_y), B = y.cos(phi_y) et C = z.sin(phi_y)
			*_xmm0 = _mm_mul_ps(trigo._xmm2[1], trigo.coord);

		// On calcule D = A * cos(phi_z), E = B * sin(phi_z) et C = C * 1
			* _xmm0 = _mm_mul_ps(*_xmm0, trigo._xmm1[2]);

		// On calcule F = D - E, C = C - 0
			*_xmm0 = _mm_hsub_ps(*_xmm0, *_xmm0);

		// On calcule xmm0 = F - C
			*_xmm0 = _mm_hsub_ps(*_xmm0, *_xmm0);

		// On save la new coordonée
			trigo.end_coord[_x] = *_xmm0;

		// == == == == == == =
		// / yaw
		// == == == == == == =

		// == == == == == == =
		// pitch
		// == == == == == == =
	//Pitch:; x
		  // On applique la rotation au point | [esi + 0] = x
		  // | [esi + 4] = y
		  // | [esi + 8] = z
		  // On calcule y = x.(cos(phi_x) * sin(phi_z) - sin(phi_x) * cos(phi_z) * sin(phi_y)) +
		  //				 y.(sin(phi_x) * sin(phi_z) * sin(phi_y) + cos(phi_x) * cos(phi_z)) -
		  //				 z.(sin(phi_x) * cos(phi_y))
		  //
		// On calcule A = cos(phi_x) * sin(phi_z), B = sin(phi_x) * cos(phi_z), E = cos(phi_x) * cos(phi_z) et F = sin(phi_x) * sin(phi_z)
		
		//movddup		xmm0, [_xmm0 + 8]
			_xmm0[2] = trigo._xmm0[2];
			_xmm0[3] = trigo._xmm0[2];

		//movddup		xmm7, [_xmm2 + 12]
			*_xmm0 = _mm_mul_ps(*_xmm0, trigo._xmm1);

		// on sauve xmm0 dans xmm7 pour le copier dans xmm0 de Roll car l'equation de y ressemblent a l'equation de z mis a part que la valeur sin(phi_y) est
		// multiplié par d'autres equations

		// On calcule C' = A' * sin(phi_y) et G' = E' * sin(phi_y)
		
		// movddup		xmm2, [_xmm2 + 16]
			_xmm7[2] = trigo._xmm2[3];
			_xmm7[3] = trigo._xmm2[3];

			*_xmm7 = _mm_mul_ps(*_xmm7, *_xmm0);

		// On calcule C = B * sin(phi_y) et G = F * sin(phi_y)
		
		// movhlps		xmm1, xmm0
			_xmm2[2] = trigo._xmm2[4];
			_xmm2[3] = trigo._xmm2[4];

			*_xmm0 = _mm_mul_ps(*_xmm0, *_xmm2);

		// Copie le contenu du haut(64..127) d'un paquet de valeurs réel de simple précision (4*32 bits) dans sa partie basse (0..31).
		// En somme on separe les deux partie x et y : xmm0 = A) cos(phi_x) * sin(phi_z)								xmm0 = cos(phi_x) * sin(phi_z)
		//											 			C) sin(phi_x) * cos(phi_z) * sin(phi_y) = > sin(phi_x) * sin(phi_y) * cos(phi_z)
		//														E) cos(phi_x) * cos(phi_z)								xmm1 = cos(phi_x) * cos(phi_z)
		//														G) sin(phi_x) * sin(phi_z) * sin(phi_y)							sin(phi_x) * sin(phi_y) * sin(phi_z)
		
		// movhlps 	xmm1, xmm0
			_xmm1[1] = _xmm0[3];
			_xmm1[0] = _xmm0[2];
		// On calcule D = A - C
			*_xmm0 = _mm_hsub_ps(*_xmm0, *_xmm0);

		// On calcule H = E + G
			*_xmm1 = _mm_hadd_ps(_xmm1, _xmm1);

		// On calcule sin(phi_x) * cos(phi_y) et cos(phi_x) * cos(phi_y)
		//
		// On calcule I.roll = cos(phi_x) * cos(phi_y) et I.Pitch = sin(phi_x) * cos(phi_y)
		// movlps		xmm3, [_xmm0 + 8]
			_xmm3[1] = trigo._xmm0[5];
			_xmm3[0] = trigo._xmm0[4];

		// movlps		xmm2, [_xmm2 + 4]
			_xmm2[1] = trigo._xmm2[2];
			_xmm2[0] = trigo._xmm2[1];

			*_xmm2 = _mm_mul_ps(*_xmm2, *_xmm3);

		// movshdup 	xmm3, xmm2
			_xmm3[0] = _xmm2[1];
			_xmm3[1] = _xmm2[1];
			_xmm3[2] = _xmm2[3];
			_xmm3[3] = _xmm2[3];
		// On calcule x.D + y.H - z.I
		//
		// On calcule J = x.D, K = y.H et L = z.I
		// movsldup		xmm4, xmm1	; y.H
			_xmm4[0] = _xmm1[0];
			_xmm4[1] = _xmm1[0];
			_xmm4[2] = _xmm1[2];
			_xmm4[3] = _xmm1[2];

		// movss		xmm4, xmm0	; x.D
			_xmm4[0] = _xmm0[0];

		// movlhps		xmm4, xmm3; z.I.Pitch
			_xmm4[2] = _xmm3[0];
			_xmm4[3] = _xmm3[1];

			*_xmm4 = _mm_mul_ps(*_xmm4, trigo.coord);

		// On calcule M = J + K
			_xmm4 = _mm_hadd_ps(*_xmm4, *_xmm4)

		// On calcule N = M - L
			_xmm4 = _mm_hsub_ps(*_xmm4, *_xmm4)

		// On save la new coordonée
			trigo.end_coord[_y] = *_xmm4;

		// == == == == == == =
		// / pitch
		// == == == == == == =
		// == == == == == == =
		// roll
		// == == == == == == =
	//Roll:; z
		// On applique la rotation au point | [esi + 0] = x
		// | [esi + 4] = y
		// | [esi + 8] = z
		// On calcule z' = x.(cos(phi_x) * cos(phi_z) * sin(phi_y) + sin(phi_x) * sin(phi_z)) + 
		//				  y.(sin(phi_x) * cos(phi_z) - cos(phi_x) * sin(phi_z) * sin(phi_y)) +
		//				  z.(cos(phi_x) * cos(phi_y))
		//
		// Copie le contenu du haut(64..127) d'un paquet de valeurs réel de simple précision (4*32 bits) dans sa partie basse (0..31).
			// En somme on separe les deux partie x et y : xmm7 = C') cos(phi_x) * sin(phi_z) * sin(phi_y)				xmm7 =	C') cos(phi_x) * sin(phi_z) * sin(phi_y))
			//											 			B') sin(phi_x) * cos(phi_z)						 =>				B') sin(phi_x) * cos(phi_z)
			//														G') cos(phi_x) * cos(phi_z) * sin(phi_y)				xmm1 =	G') cos(phi_x) * cos(phi_z) * sin(phi_y)
			//														F') sin(phi_x) * sin(phi_z)										F') sin(phi_x) * sin(phi_z)
		// movhlps		xmm1, xmm7	
			_xmm1[0] = _xmm7[2];
			_xmm1[1] = _xmm7[3];

		// On calcule D' = -B' + C'
			or	xmm7[0], 0x80000000;
			*_xmm7 = _mm_hadd_ps(_xmm7, _xmm7);

		// On calcule H' = G' + F'
			*_xmm1 = _mm_hadd_ps(_xmm1, _xmm1);

		// On calcule x.D' + y.H' + z.I'
		//
		// On calcule J = x.D', K = y.H' et L = z.I'
		
		// movsldup		xmm4, xmm7	; y.D'	
			_xmm4[0] = _xmm7[0];
			_xmm4[1] = _xmm7[0];
			_xmm4[2] = _xmm7[2];
			_xmm4[3] = _xmm7[2];

		// movss		xmm4, xmm1	; x.H'
			_xmm4[0] = _xmm1[0];

		// movlhps 	xmm4, xmm2	; z.I'
			_xmm4[2] = _xmm2[0];
			_xmm4[3] = _xmm2[1];

			*_xmm4 = _mm_mul_ps(*_xmm4, trigo.coord);

		// On calcule M' = J' + K'
			_xmm4 = _mm_hadd_ps(*_xmm4, *_xmm4)

		// On calcule N' = M' + L'
			_xmm4 = _mm_hadd_ps(*_xmm4, *_xmm4)

		// On save la new coordonée
			trigo.end_coord[_z] = *_xmm4;

		// == == == == == == =
		// / roll
		// == == == == == == =

trigo._xmm0[] array member should have size divisable by 4 in order to be loaded into xmm register. I do not think if it is a good way to use compiler intrinsic *_xmm0 = _mm_mul_ps(trigo._xmm2[1], trigo.coord);

Anonymous · ‎10-07-2014

trigo._xmm0 have 6 float member
       ._xmm1 
       ._xmm2

So i will not read out of array, i have control of those array, don't worry. If of course icl will not fragment array's member.

Anonymous · ‎10-07-2014

It will be difficult with IDA :/

Anonymous · ‎10-07-2014

For test difference speed, i can't test fsincos through asminline, cause, like i wrote, only one asm in line fall down my fps of -170 (now) fps :D

The only way to see the truth is contact intel engineer :/

Bernard · ‎10-07-2014

>>>So i will not read out of array, i have control of those array, don't worry. If of course icl will not fragment array's member.>>>

ICL will not fragment array data type it is by design of the compiler to allocate contigous chunk of memory for the array.

Bernard · ‎10-07-2014

>>>It will be difficult with IDA :/>>>

Yes I know that, but for simple code patching it can be used.

Bernard · ‎10-10-2014

>>>For test difference speed, i can't test fsincos through asminline, cause, like i wrote, only one asm in line fall down my fps of -170 (now) fps :D>>>

Does it mean that fsincos caused 170 fps drop?

Anonymous · ‎10-10-2014

No of course, i mean if i put only one mov, it fall down:

Ex:

              __asm      mov       eax, 0xdeadbeef

Two reason for this behaviour, one is icl don't like asm inline, second i don't know how to configure compiler's option correctly .

Anonymous · ‎10-11-2014

Intel need really must learn, in this world there isn't only hacker, but developers who want programming without obstacle.

I talk about new technologie of protections program and data: Intel SGX && intel MPX. I'm scared about futur, who we'll don't allow programming without enter through a lot of protocol, like 10_000 ^^

And of course the most powerfull technologie "anti-hack", pagination :D (hate this one particularly)

Do you know how to access to first pixel of thé screen, easier way like vesa. I would like to continue my project of 3D engine without OS in IA-32e mode. i don't know why VBE core is discontinue :/

I have learn we're king of the world if we can access to pixel of screen and second parameter is how many time need to access to them.

Do intel show calendar of new Extensions ISA ? i tried to search it