Solved: 3D engine - Page 9

Anonymous · ‎09-02-2014

Hello there,

ok, here we go, I have a dream, make a 3D engine 100% assembler intel only with CPU, I use rotation matrix only for now.

it works of course, but it's slow when I put a lot of pixels.

Recently I decided to include voxels in my engine, and it's slow when I put> = 8000 voxels (20 * 20 * 20 cube) and when I saw that nvidia display 32M voxels (fire) I wonder how they can do it !

And I have a little idea of the reason: MMU, paging, segmentation. memory.

Am I right?

Another question, is the FPU is the slowest to compute floating point than SSE or depending of data manipulate ?

PS: I work without OS like Windows or Linux, I run on my own kernel + bootloader in assembly too with NASM.

Sorry if i don't wirte a good english, i'm french and use google translate ^-^

Bradley_W_Intel · ‎09-02-2014

You clearly are using the processor in a very advanced way. I will do my best to answer your questions:

1) Why is your voxel engine not able to efficiently render as many voxels as you'd like? Voxel engines need to maximize their use of parallelism (both threading and SIMD) and also to store the data efficiently in an octree or some other structure that can handle sparse data. If you are doing all these things and still not getting the performance you expect, it's an optimization problem. Some Intel tools like VTune Performance Analyzer are excellent for performance analysis.

2) Is single data floating point math faster than SIMD (if I understood you)? Typically SIMD will be faster than single data instructions if your data is laid out in a way that supports the SIMD calls. In all cases, the only way for you to know for certain which way is faster is to test it.

3) How can you select between discrete and processor graphics? DirectX has methods of enumerating adapters. In such a case, the processor graphics is listed separately from the discrete graphics. If you are choosing your adapter based on the amount of available memory, you may be favoring the processor graphics when you didn't intend to. Intel has sample code that shows how to properly detect adapters in DirectX at https://software.intel.com/en-us/vcsource/samples/gpu-detect. The process for OpenGL is not well documented.

4) Can I use one processor to control execution of a second processor? Probably not. The details on Intel processors are covered at http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html. It's possible, though unlikely, that you'll be able to find something in there that can help you.

View solution in original post

Anonymous · ‎12-10-2014

And for the story about disassemble Bios routines, there is a little problem for a lot of BIOS routines () i know where they are (address) and then i can disassemble them, but for the 0x10 case i wasn't able to acces with call (address) located in IVT (interrupt vector table).

So i don't know where int 0x10, jump for execute this BIOS routines :/

For exemple:

		; xor		ah, ah
		; pusha
		; call		0xF000:0xE82E		; --> int 0x16
		; popa
		
		; pusha
		; call		0xF000:0x5EE9		; --> int 0x13
		; popa				
		
		; mov		ax, 0x4F02 
		; mov		bx, 0x0115 
		; pusha
		; call		0x0C00:0014		; --> int	0x10
		; popa

The int 0x16 and int 0x13, it's work but the calling of int 0x10 don't work. That's a BIG problem ^^

I'm able to do it, by this program i wrote (octet = byte):

; Avant d'attaquer cette source, il faut que vous sachiez que la taille de l'IVT est de 1 Ko soit 1024 octet
; *0x00000-0x003FF IVT ( table des vecteur d'interrupions)
; De plus il faut savoir que l'IVT contient des adresse mémoire qui pointent vers des fonctions 
; quant on fait apelle à 'int' en asm on saute vers une adresse mémoire contenue dans
; cet IVT. Pour connaitre une adresse d'une l'interruption, faire (n°_de_la_fonction*4)
; Par exemple pour l'int 10h, son adresse dans la table est: 10*4= 40h.
; De plus l'adresse donné est sur 4 bits, donc l'adresse de l'int 10h est de 40h à 43h.

; Pour revoir les instruction asm en cas d'oubli  go to : http://www.gladir.com/LEXIQUE/ASM/DICTIONN.HTM

; Creation: 06/02/12 à 21:11:25


%define 	nbr_sector_IVT  0x2  		

;=============================================================================================================
;CODE		CODE		CODE		CODE		CODE		CODE		CODE		CODE		CODE	  CODE	
;=============================================================================================================
	main_kernel:	
			
		mov 	byte [bootdev], dl 	
		
		; Positionne le flag TF (Trap flag) à 1
		pushf
		pop     ax
		or      ax,0000000100000000b   ; Met le TF à 1
		push    ax
		popf
	
		; [es:di] --> 0x1000+(tabl_IVT-0x1000)
		mov		ax, 0x100
		mov		es, ax
		mov		edi, tabl_IVT-0x1000
		
		; [fs:si] --> 0x00000000	; Pointeur sur l'IVT
		xor		ax, ax
		mov		fs, ax
		xor		si, si
		mov		ecx, 256
		 
		; Citation : 
		; Lorsqu’une interruption survient, le numéro de 
		; l’interruption permet de trouver l’emplacement 
		; des instructions à exécuter. Le numéro de 
		; l’interruption est multiplié par 4 afin de trouver 
		; l’adresse du CS et du IP à rechercher (IP est à
		; l’adresse inférieure, suivi de CS). Puis un JMP à
		; CS:IP est fait. CS:IP est un « vecteur » vers le 
		; code à exécuter afin de répondre à l’interruption.
		; - Exemple: Si le contenu de la mémoire, à partir de 
		; l’adresse 00000h est 00h, 01h, 02h, 03h, 04h, 05h, 
		; 06h, 07h, 08h, etc... et que l’interruption 1 survient, 
		; alors la routine à l’adresse 0706:0504 sera 
		; exécutée.
		IVT:
			mov		eax, [fs:si]
			add		si, 4
			
			; Ici je swap les 2 octet de la partie basse 
			; et de la parie haute 
			mov		bh, al	; al--> 4eme octet de tabl_IVT
			mov		bl, ah	; ah--> 3eme octet	"
			shr		eax, 16 ; Asctuce: on déplace le 1er et le 2eme octet dans la partie basse de eax
							; pour pouvoir les adresser, ce qui est impossible dans la partie
							; haute 
			mov		dh, al  ; al--> 2eme octet de tabl_IVT
			mov		dl, ah	; al--> 1er octet 	"
			
			; Next je swap la parie haute et la partie basse de eax		
			mov		ax, bx 
			shl		eax, 16
			mov		ax, dx
			
			stosd		; mov [es:edi], eax  On save les 4 autres octets de eax dans tabl_IVT
			
		loop	IVT	; 256 sauvegarde 4 octet provenant de l'IVT 

	 ;=============
	 ; writesector
	 ; Lecture d'un secteur
	 ; Entrée :
	 ;			AX = numéro logique du secteur (0..2879)
	 ;			ES:BX = le buffer qui contient les données à écrire
	 ;			DL = unité de dique 
	 ; Sortie : Aucune
	 ;=============
	 writesector:

			;------------------- Effectuer l'écriture de secteur en mode CHS (Cylindre Head Sector) où le kernel se trouve					 

			 mov	ax, 0x100
			 mov	es, ax
			 mov	bx, tabl_IVT-0x1000
			 mov 	ah, 3				; Fonctions 02h de int 13h																				 
			 mov 	al, nbr_sector_IVT		; nbr_sector secteurs à écrire(1 secteur = 512 octet)											 
			 xor 	ch, ch				; Cylinder=0																						 
			 mov 	cl, 3				; Sector=2																							 
			 xor 	dh, dh				; Head=0																								 
			 mov 	dl, [bootdev]		; Drive																								 
			int 	0x13	

	 ;=============
	 ;/writesector
	 ;=============
	jmp		0xF000:0xE05B	; Reboot
;=======================================================================================================
;END_CODE		END_CODE		END_CODE		END_CODE		END_CODE		END_CODE		END_CODE
;=======================================================================================================

;================================================================================================
;DATA		DATA		DATA		DATA		DATA		DATA		DATA		DATA		
;================================================================================================ 

	
	tabl_IVT:
		times 	1024	db 		0

;=======================================================================================================
;END_DATA		END_DATA		END_DATA		END_DATA		END_DATA		END_DATA		END_DATA		
;=======================================================================================================
	
	times 	512  db 	0  

;*Hacker's emblem:				*Art Ascii:																		
;	|_|0|_|						;	(\___/)			 									
;	|_|_|0|						;	(='.'=)											   												   
;	|0|0|0|						;	(")_(")																						
								;This is Bunny. Copy and paste bunny into...																									 
								;...your signature to help him gain world domination
;----------------------------------------------------------------------------------------------------------	
;|	 ____											 O_O        	  I\_/I			(°_°) 		 ,,,,	  |																			
;|	[@__@]		(o_o)     (°_°)    (^_^)			(^_^)       	  (^_^)			 /T\		(°v°)     |
;|-------------------------------------------___  °(_   _)°*    *	°(_   _)°*       / \	   /(   )\	  |
;												--------------------------------------------___	 Y Y	  |										 
;																								----------|

;	   /\_/\
;	  / 0 0 \
;	 ====v====
;	  \  W  /
;	  |     |     _
;	  / ___ \    /
;	 / /   \ \  |
;	(((-----)))-'
;	  /
;	 (      ___
;	  \__.=|___E
;		   /

; J'espère que cela a été une précieuse aide dans vos recherches :)

Bernard · ‎12-10-2014

Hi shaynox,

It is nice to see you back on IDZ.

I uploaded a simple wrapper class for SIMD instructions. I am using it for developing Renderer.

Bernard · ‎12-10-2014

shaynox s. wrote:

I have a question: why AVX register is lower than SSE register ?

I have a i7-2640M CPU, and when i try to access to AVX register in my program my fps fall down at 5 fps whereas when i keep SSE register my fps is higher: 27.

Here's a sample:
		VBROADCASTF128		ymm2, [rotate_yz_ymm2]
		VBROADCASTF128		ymm3, [rotate_xyz_ymm3]
		VBROADCASTF128		ymm4, [rotate_yz_ymm4]
		VBROADCASTF128		ymm5, [rotate_z_ymm5]
		VBROADCASTF128		ymm6, [rotate_y_ymm6]
		VBROADCASTF128		ymm7, [coordonee]
is slower than:
		vmovups		xmm2, [rotate_yz_ymm2]
		vmovups		xmm3, [rotate_xyz_ymm3]
		vmovups		xmm4, [rotate_yz_ymm4]
		vmovups		xmm5, [rotate_z_ymm5]
		vmovups		xmm6, [rotate_y_ymm6]
		vmovups		xmm7, [coordonee]
strange :/

Does your code intermixes SSE and AVX instructions in the same code path? If it does then you have SSE-to-AVX transition penalties. In order to solve it use vzeroupper instruction.

Anonymous · ‎12-10-2014

Yes i mix SSE and AVX instruction, but like xmm is mapped into ymm, i think there isn't any problem of compatibily.

vzeroupper don't work, but ok i will try to put only avx register.

Anonymous · ‎12-10-2014

for your header, nice to upload it, but like i don't code/know in c++ ^^

Anonymous · ‎12-10-2014

Wooow it's work, thanks !!!

I've just upgrade all SSE instruction with v- prefix and no more latency, so happy, but i have 6/7 fps less than SSE instruction :/

So i have 21 fps with upgrade with AVX instruction (v- prefix) and 27 fps with SSE instruction.

In that case i really hesitate to write my 3D rendering with only intel compiler in C langage and abandon nasm/asm :/

i thought writing asm permit to me bring performance cause i can use/exploit all instruction instead compiler use restricted instruction (even icl i guess) and for other reason like i'm near mechanic of CPU i can do all what i want ^^

I programming in assembly just for optimization, i love it, even on one little source code.

I'm really disappointed ^^ with visual studio + icl: i have 48 fps with 1920*1080 resolution and with notepad++ + nasm: i have 16 fps with one rendering 100*100*100 cube at screen with last technologie of my processor.

...

About the vectorisation helping of compiler, my friend use union keyword (he use gcc for build his own 3D engine) and it's work perfectly, when he send to me asm file generate by gcc, i see vmulps ymm., ymm. .

I don't get it how he can use AVX register whereas icl don't use it but it use AVX instruction, problem of compiler's option ?

Bernard · ‎12-11-2014

shaynox s. wrote:

Yes i mix SSE and AVX instruction, but like xmm is mapped into ymm, i think there isn't any problem of compatibily.

vzeroupper don't work, but ok i will try to put only avx register.

Do not mix those different instruction types. Use either AVX or SSE.

https://software.intel.com/en-us/articles/avoiding-avx-sse-transition-penalties

Bernard · ‎12-11-2014

>>>for your header, nice to upload it, but like i don't code/know in c++ ^^>>>

Tomorrow I will upload C version of lightweight SIMD library.

Bernard · ‎12-11-2014

shaynox s. wrote:

Wooow it's work, thanks !!!

I've just upgrade all SSE instruction with v- prefix and no more latency, so happy, but i have 6/7 fps less than SSE instruction :/

So i have 21 fps with upgrade with AVX instruction (v- prefix) and 27 fps with SSE instruction.

In that case i really hesitate to write my 3D rendering with only intel compiler in C langage and abandon nasm/asm :/

i thought writing asm permit to me bring performance cause i can use/exploit all instruction instead compiler use restricted instruction (even icl i guess) and for other reason like i'm near mechanic of CPU i can do all what i want ^^

I programming in assembly just for optimization, i love it, even on one little source code.

I'm really disappointed ^^ with visual studio + icl: i have 48 fps with 1920*1080 resolution and with notepad++ + nasm: i have 16 fps with one rendering 100*100*100 cube at screen with last technologie of my processor.

...

About the vectorisation helping of compiler, my friend use union keyword (he use gcc for build his own 3D engine) and it's work perfectly, when he send to me asm file generate by gcc, i see vmulps ymm., ymm. .

I don't get it how he can use AVX register whereas icl don't use it but it use AVX instruction, problem of compiler's option ?

In order to effectively investigate different coding strategies(AVX vs SSE) you should use VTune profiler.

You can write your own 3D engine on Windows with MASM or NASM assemblers , but it will be not easy task. Think about writing renderer in asm which is based on DirectX.

Not always you will be able to beat optimizing compiler. I advise to write some code in asm NASM syntax and write the same code in C and use Compiler optimization next compare both results.Compiler usually will use various transformations tricks in order to speed up code execution.

Please look at this book: http://www.amazon.com/Optimizing-Compilers-Modern-Architectures-Dependence-based/dp/1558602860

Anonymous · ‎12-12-2014

I have bring 80 fps (with additional program running (vlc, mozilla, skype, ect) and 100 fps without them) only by modifing compiler option (60 before), and compiler still didn't vectorise matrix rotation.

Or i don't know if he vectorize cause i can't get asm file with /Qipo option active (Interprocedural optimization, warning #10199: IPO enabled; /Fa and /FA produce dummy .asm files) and i have those 80/100 fps with this option and 60 fps without it, so maybe he vectorize but i can't see :D

I have upload projet, it use WinAPI now and no more SDL.

Use: key escape for quit program

(

/MP /GS- /TC /GA /W3 /Zc:wchar_t /Zi /O2 /Qopt-report:1 /Qopt-report-phase:vec /Fd"x64\Release\vc120.pdb" /fp:fast /Qipo /GF /Zc:forScope /arch:AVX /MT /Fa"x64\Release\" /nologo /Qparallel /Fo"x64\Release\" /C /Qprof-dir "x64\Release\" /FAs /Ot /Fp"x64\Release\HackEngine.pch"

)

Bernard · ‎12-13-2014

>>>Or i don't know if he vectorize cause i can't get asm file with /Qipo option active (Interprocedural optimization, warning #10199: IPO enabled; /Fa and /FA produce dummy .asm files) and i have those 80/100 fps with this option and 60 fps without it, so maybe he vectorize but i can't see :D>>>

Can you post assembly code of that matrix? If code was vectorized you usually should see unrolling and vector load instructions: vmovaps vs vmovss.

Anonymous · ‎12-13-2014

;;; 	trigo.end_coord[_x] = trigo.coord[_x] * ((trigo.cos[_y] * trigo.cos[_z])) - trigo.coord[_y] * ((trigo.cos[_y] * trigo.sin[_z])) - trigo.coord[_z] * (trigo.sin[_y]);

        vmulss    xmm0, xmm1, DWORD PTR [r13]                   ;158.2
$LN407:
        vmulss    xmm15, xmm2, DWORD PTR [4+r13]                ;158.2
$LN408:
        vmulss    xmm4, xmm14, DWORD PTR [8+r13]                ;158.2
$LN409:
        vsubss    xmm0, xmm0, xmm15                             ;158.2
$LN410:

;;; 	trigo.end_coord[_y] = trigo.coord[_x] * ((trigo.cos[_x] * trigo.sin[_z]) - (trigo.sin[_x] * trigo.sin[_y] * trigo.cos[_z])) + trigo.coord[_y] * ((trigo.sin[_x] * trigo.sin[_y] * trigo.sin[_z]) + (trigo.cos[_x] * trigo.cos[_z])) - trigo.coord[_z] * (trigo.sin[_x] * trigo.cos[_y]);

        vmulss    xmm15, xmm13, xmm2                            ;159.2
$LN411:
        vmulss    xmm3, xmm0, xmm12                             ;158.2
$LN412:
        vxorps    xmm5, xmm4, XMMWORD PTR [_2il0floatpacket.28] ;158.2
$LN413:
        vaddss    xmm0, xmm3, xmm5                              ;158.2
$LN414:
        vmulss    xmm5, xmm11, xmm14                            ;159.2
$LN415:
        vmulss    xmm4, xmm5, xmm1                              ;159.2
$LN416:
        vmulss    xmm5, xmm2, xmm5                              ;159.2
$LN417:
        vsubss    xmm3, xmm15, xmm4                             ;159.2
$LN418:
        vmulss    xmm15, xmm13, xmm1                            ;159.2
$LN419:
        vmovss    DWORD PTR [trigo+56], xmm0                    ;158.2
$LN420:
        vmulss    xmm4, xmm3, DWORD PTR [r13]                   ;159.2
$LN421:
        vaddss    xmm3, xmm5, xmm15                             ;159.2
$LN422:
        vmulss    xmm5, xmm3, DWORD PTR [4+r13]                 ;159.2
$LN423:

;;; 	trigo.end_coord[_z] = trigo.coord[_x] * ((trigo.cos[_x] * trigo.sin[_y] * trigo.cos[_z]) + (trigo.sin[_x] * trigo.sin[_z])) + trigo.coord[_y] * ((trigo.sin[_x] * trigo.cos[_z]) - (trigo.cos[_x] * trigo.sin[_y] * trigo.sin[_z])) + trigo.coord[_z] * (trigo.cos[_x] * trigo.cos[_y]);

        vmulss    xmm3, xmm14, xmm13                            ;160.2
$LN424:
        vaddss    xmm15, xmm4, xmm5                             ;159.2
$LN425:
        vmulss    xmm4, xmm11, xmm12                            ;159.2
$LN426:
        vmulss    xmm14, xmm3, xmm1                             ;160.2
$LN427:
        vmulss    xmm5, xmm11, xmm2                             ;160.2
$LN428:
        vmulss    xmm4, xmm4, DWORD PTR [8+r13]                 ;159.2
$LN429:
        vmulss    xmm11, xmm11, xmm1                            ;160.2
$LN430:
        vmulss    xmm1, xmm2, xmm3                              ;160.2
$LN431:
        vmulss    xmm12, xmm13, xmm12                           ;160.2
$LN432:
        vaddss    xmm14, xmm14, xmm5                            ;160.2
$LN433:
        vsubss    xmm4, xmm15, xmm4                             ;159.2
$LN434:
        vsubss    xmm2, xmm11, xmm1                             ;160.2
$LN435:
        vmovss    DWORD PTR [trigo+60], xmm4                    ;159.2
$LN436:
        vmulss    xmm14, xmm14, DWORD PTR [r13]                 ;160.2
$LN437:
        vmulss    xmm3, xmm2, DWORD PTR [4+r13]                 ;160.2
$LN438:
        vmulss    xmm13, xmm12, DWORD PTR [8+r13]               ;160.2
$LN439:
        vaddss    xmm5, xmm14, xmm3                             ;160.2
$LN440:

And here's my new calculation:

;====================================================================================================
;FONCTIONS		FONCTIONS		  FONCTIONS	    	FONCTIONS	    	FONCTIONS	
;====================================================================================================			
; make_rotations:
; On applique la rotation au point	|[rsi + 0] = x
;									|[rsi + 4] = y
;									|[rsi + 8] = z
;									|[rsi + 12] = color		.Unmodified

			vmovups		ymm2, [rotate_yz_ymm2]
			vmovups		ymm3, [rotate_xyz_ymm3]
			vmovups		ymm4, [rotate_yz_ymm4]
			vmovups		ymm5, [rotate_z_ymm5]
			vmovups		ymm6, [rotate_y_ymm6]
			vmovups		ymm7, [coordonee]
		;=============
		; yaw
		;=============
			Yaw:	; y
				;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
				;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
				;; On calcule x = x.(cos(phi_y) * cos(phi_z)) -        ;;
				;;				  y.(cos(phi_y) * sin(phi_z)) -        ;;
				;;				  z.(sin(phi_y))                       ;;
				;;                                                     ;;
				;;	; ymm0[8] =     ymm0[8]  vmulps   ymm3[8]          ;;
				;;	; 		      cos(phi_y)   *    cos(phi_z)         ;;
				;;	; 		      cos(phi_y)   *    sin(phi_z)         ;;
				;;	;		      sin(phi_y)   *        1              ;;
				;;	;		  	      0 	   *        0              ;;
				;;	; 		      cos(phi_y)   *    cos(phi_z)         ;;
				;;	; 		      cos(phi_y)   *    sin(phi_z)         ;;
				;;	;		      sin(phi_y)   *        1              ;;
				;;	;		  	      0 	   *        0              ;;
				;;                             |                       ;;
				;;	; ymm0[8] =     ymm0[8]  vmulps   ymm7[8]          ;;
				;;	; 		 	    ymm0[0]    *        x              ;;
				;;	; 		 	    ymm0[1]    *        y              ;;
				;;	; 		 	    ymm0[2]    *        z              ;;
				;;	; 		 	    ymm0[3]	   *      color            ;;
				;;	; 		 	    ymm0[4]    *        x              ;;
				;;	; 		 	    ymm0[5]    *        y              ;;
				;;	; 		 	    ymm0[6]    *        z              ;;
				;;	; 		 	    ymm0[7]	   *      color            ;;
				;;                             |                       ;;
				;;	; ymm0[8] =     ymm0[8]  vhsubps   ymm0[8]         ;;
				;;	; 		  	    ymm0[0]     -      ymm0[1]         ;;
				;;	; 		  	    ymm0[2]     -      ymm0[3]         ;;
				;;	; 		  	    ymm0[4]     -      ymm0[5]         ;;
				;;	; 		  	    ymm0[6]     -      ymm0[7]         ;;
				;;	; 		  	    ymm0[0]     -      ymm0[1]         ;;
				;;	; 		  	    ymm0[2]     -      ymm0[3]         ;;
				;;	; 		  	    ymm0[4]     -      ymm0[5]         ;;
				;;	; 		  	    ymm0[6]     -      ymm0[7]         ;;
				;;                              |                      ;;
				;;	; ymm0[8] =     ymm0[8]  vhsubps   ymm0[8]         ;;
				;;	; 		  	    ymm0[0]     -      ymm0[1]         ;;
				;;	; 		  	    ymm0[2]     -      ymm0[3]         ;;
				;;	; 		  	    ymm0[4]     -      ymm0[5]         ;;
				;;	; 		  	    ymm0[6]     -      ymm0[7]         ;;
				;;	; 		  	    ymm0[0]     -      ymm0[1]         ;;
				;;	; 		  	    ymm0[2]     -      ymm0[3]         ;;
				;;	; 		  	    ymm0[4]     -      ymm0[5]         ;;
				;;	; 		  	    ymm0[6]     -      ymm0[7]         ;;
				;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
				;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
				vmovups		ymm0, [rotate_x_ymm0]
				
				vmulps		ymm0, ymm3			; ymm3 = rotate_xyz_ymm3
				vmulps		ymm0, ymm7			; ymm7 = coordonee
				
				vhsubps		ymm0, ymm0
				vhsubps		ymm0, ymm0

				; On save la new coordonée
					movss		[coordonee_after + _y], xmm0
					
						vmovups		[temp], ymm0
						movss		xmm0, [temp + _color + 4]
					movss		[coordonee_after + 12], xmm0
		;=============
		; / yaw
		;=============	
	
		;=============
		; pitch
		;=============
			Pitch:	; x
				;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
				;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
				;; On calcule y = x.((cos(phi_x) * sin(phi_z)) - (sin(phi_x) * cos(phi_z) * sin(phi_y))) +      ;;
				;;				  y.((cos(phi_x) * cos(phi_z)) + (sin(phi_x) * sin(phi_z) * sin(phi_y))) -      ;;
				;;				  z.( sin(phi_x) * cos(phi_y))                                                  ;;
				;;                                                                                              ;;
				;;	; ymm0[8] =     ymm0[8]  vmulps   ymm2[8]                                                   ;;
				;;	; 		      cos(phi_x)   *    sin(phi_z)                                                  ;;
				;;	; 		      cos(phi_x)   *    cos(phi_z)                                                  ;;
				;;	; 		      sin(phi_x)   *    cos(phi_y)                                                  ;;
				;;	;		  	      0 	   *        0                                                       ;;
				;;	; 		      cos(phi_x)   *    sin(phi_z)                                                  ;;
				;;	; 		      cos(phi_x)   *    cos(phi_z)                                                  ;;
				;;	; 		      sin(phi_x)   *    cos(phi_y)                                                  ;;
				;;	;		  	      0	       *        0                                                       ;;
				;;                             |                                                                ;;
				;;	; ymm1[8] =     ymm1[8]  vmulps   ymm3[8]                                                   ;;
				;;	; 		      sin(phi_x)   *    cos(phi_z)                                                  ;;
				;;	; 		      sin(phi_x)   *    sin(phi_z)                                                  ;;
				;;	;		  	      0 	   * 	    1                                                       ;;
				;;	;		  	      0 	   *        0                                                       ;;
				;;	; 		      sin(phi_x)   *    cos(phi_z)                                                  ;;
				;;	; 		      sin(phi_x)   *    sin(phi_z)                                                  ;;
				;;	;		  	      0 	   *        1                                                       ;;
				;;	;		  	      0 	   *        0                                                       ;;
				;;                             |                                                                ;;
				;;	; ymm1[8] =     ymm1[8]  vmulps   ymm4[8]                                                   ;;
				;;	; 		  	    ymm1[0]    *    sin(phi_y)                                                  ;;
				;;	; 		  	    ymm1[1]    *    sin(phi_y)                                                  ;;
				;;	; 		  	    ymm1[2]    *        0                                                       ;;
				;;	; 		  	    ymm1[3]    *        0                                                       ;;
				;;	; 		  	    ymm1[4]    *    sin(phi_y)                                                  ;;
				;;	; 		  	    ymm1[5]    *    sin(phi_y)                                                  ;;
				;;	; 		  	    ymm1[6]    *        0                                                       ;;
				;;	; 		  	    ymm1[7]    *        0                                                       ;;
				;;                             |                                                                ;;
				;;	; ymm0[8] =     ymm0[8] vaddsubps ymm1[8]                                                   ;;
				;;	; 		  	    ymm0[0]     -     ymm1[0]                                                   ;;
				;;	; 		  	    ymm0[1]     +     ymm1[1]                                                   ;;
				;;	; 		  	    ymm0[2]     -     ymm1[2]                                                   ;;
				;;	; 		  	    ymm0[3]     +     ymm1[3]                                                   ;;
				;;	; 		  	    ymm0[4]     -     ymm1[4]                                                   ;;
				;;	; 		  	    ymm0[5]     +     ymm1[5]                                                   ;;
				;;	; 		  	    ymm0[6]     -     ymm1[6]                                                   ;;
				;;	; 		  	    ymm0[7]     +     ymm1[7]                                                   ;;
				;;                             |                                                                ;;
				;;	; ymm0[8] =     ymm0[8]  vmulps   ymm7[8]                                                   ;;
				;;	; 		 	    ymm0[0]    *        x                                                       ;;
				;;	; 		 	    ymm0[1]    *        y                                                       ;;
				;;	; 		 	    ymm0[2]    *        z                                                       ;;
				;;	; 		 	    ymm0[3]	   *      color                                                     ;;
				;;	; 		 	    ymm0[4]    *        x                                                       ;;
				;;	; 		 	    ymm0[5]    *        y                                                       ;;
				;;	; 		 	    ymm0[6]    *        z                                                       ;;
				;;	; 		 	    ymm0[7]	   *      color                                                     ;;
				;;                             |                                                                ;;
				;;	; ymm0[8] =     ymm0[8]  vmulps  ymm6[8]                                                    ;;
				;;	; 		  	    ymm0[0]    *       1                                                        ;;
				;;	; 		  	    ymm0[1]    *      -1                                                        ;;
				;;	; 		  	    ymm0[2]    *       1                                                        ;;
				;;	; 		  	    ymm0[3]    *       0                                                        ;;
				;;	; 		  	    ymm0[4]    *       1                                                        ;;
				;;	; 		  	    ymm0[5]    *      -1                                                        ;;
				;;	; 		  	    ymm0[6]    *       1                                                        ;;
				;;	; 		  	    ymm0[7]    *       0                                                        ;;
				;;                             |                                                                ;;
				;;	; ymm0[8] =     ymm0[8]  vhsubps   ymm0[8]                                                  ;;
				;;	; 		  	    ymm0[0]     -      ymm0[1]                                                  ;;
				;;	; 		  	    ymm0[2]     -      ymm0[3]                                                  ;;
				;;	; 		  	    ymm0[4]     -      ymm0[5]                                                  ;;
				;;	; 		  	    ymm0[6]     -      ymm0[7]                                                  ;;
				;;	; 		  	    ymm0[0]     -      ymm0[1]                                                  ;;
				;;	; 		  	    ymm0[2]     -      ymm0[3]                                                  ;;
				;;	; 		  	    ymm0[4]     -      ymm0[5]                                                  ;;
				;;	; 		  	    ymm0[6]     -      ymm0[7]                                                  ;;
				;;                              |                                                               ;;
				;;	; ymm0[8] =     ymm0[8]  vhsubps   ymm0[8]                                                  ;;
				;;	; 		  	    ymm0[0]     -      ymm0[1]                                                  ;;
				;;	; 		  	    ymm0[2]     -      ymm0[3]                                                  ;;
				;;	; 		  	    ymm0[4]     -      ymm0[5]                                                  ;;
				;;	; 		  	    ymm0[6]     -      ymm0[7]                                                  ;;
				;;	; 		  	    ymm0[0]     -      ymm0[1]                                                  ;;
				;;	; 		  	    ymm0[2]     -      ymm0[3]                                                  ;;
				;;	; 		  	    ymm0[4]     -      ymm0[5]                                                  ;;
				;;	; 		  	    ymm0[6]     -      ymm0[7]                                                  ;;
				;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
				;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
				vmovups		ymm0, [rotate_y_ymm0]
				vmovups		ymm1, [rotate_y_ymm1]
				
				vmulps		ymm0, ymm2			; ymm2 = rotate_yz_ymm2
				vmulps		ymm1, ymm3			; ymm3 = rotate_xyz_ymm3
				vmulps		ymm1, ymm4			; ymm4 = rotate_yz_ymm4
				
				vaddsubps	ymm0, ymm1

				vmulps		ymm0, ymm7			; ymm7 = coordonee
				vmulps		ymm0, ymm6			; ymm6 = rotate_y_ymm6
				
				vhsubps		ymm0, ymm0
				vhsubps		ymm0, ymm0

				; On save la new coordonée: x
				movss		[coordonee_after + _x], xmm0
				
					vmovups		[temp], ymm0
					movss		xmm0, [temp + _color + 4]
				movss		[coordonee_after + 8], xmm0
		;=============
		; / pitch
		;=============
		;=============
		; roll
		;=============
			Roll:	; z	
				;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
				;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
				;; On calcule z = x.((sin(phi_x) * sin(phi_z)) + (cos(phi_x) * cos(phi_z) * sin(phi_y))) +      ;;
				;;				  y.((sin(phi_x) * cos(phi_z)) - (cos(phi_x) * sin(phi_z) * sin(phi_y))) +      ;;
				;;				  z.( cos(phi_x) * cos(phi_y))                                                  ;;
				;;                                                                                              ;;
				;;	; ymm0[8] =     ymm0[8]  vmulps   ymm2[8]                                                   ;;
				;;	; 		      sin(phi_x)   *    sin(phi_z)                                                  ;;
				;;	; 		      sin(phi_x)   *    cos(phi_z)                                                  ;;
				;;	; 		      cos(phi_x)   *    cos(phi_y)                                                  ;;
				;;	;		  	      0 	   * 	    0                                                       ;;
				;;	; 		      sin(phi_x)   *    sin(phi_z)                                                  ;;
				;;	; 		      sin(phi_x)   *    cos(phi_z)                                                  ;;
				;;	; 		      cos(phi_x)   *    cos(phi_y)                                                  ;;
				;;	;		  	      0 	   * 	    0                                                      ;;
				;;                             |                                                                ;;
				;;	; ymm1[8] =     ymm1[8]  vmulps   ymm3[8]                                                   ;;
				;;	; 		      cos(phi_x)   *    cos(phi_z)                                                  ;;
				;;	; 		      cos(phi_x)   *    sin(phi_z)                                                  ;;
				;;	;		  	      0 	   * 	    1                                                       ;;
				;;	;		  	      0 	   * 	    0                                                       ;;
				;;	; 		      cos(phi_x)   *    cos(phi_z)                                                  ;;
				;;	; 		      cos(phi_x)   *    sin(phi_z)                                                  ;;
				;;	;		  	      0 	   *  	    1                                                       ;;
				;;	;		  	      0 	   * 	    0                                                       ;;
				;;                             |                                                                ;;
				;;	; ymm1[8] =     ymm1[8]  vmulps   ymm4[8]                                                   ;;
				;;	; 		  	    ymm1[0]    *    sin(phi_y)                                                  ;;
				;;	; 		  	    ymm1[1]    *    sin(phi_y)                                                  ;;
				;;	; 		  	    ymm1[2]    * 	    0                                                       ;;
				;;	; 		  	    ymm1[3]    * 	    0                                                       ;;
				;;	; 		  	    ymm1[4]    *    sin(phi_y)                                                  ;;
				;;	; 		  	    ymm1[5]    *    sin(phi_y)                                                  ;;
				;;	; 		  	    ymm1[6]    * 	    0                                                       ;;
				;;	; 		  	    ymm1[7]    * 	    0                                                       ;;
				;;                             |                                                                ;;
				;;	; ymm1[8] =     ymm1[8]  vmulps   ymm5[8]                                                   ;;
				;;	; 		  	    ymm1[0]    *        -1                                                      ;;
				;;	; 		  	    ymm1[1]    *        -1                                                      ;;
				;;	; 		  	    ymm1[2]    *        -1                                                      ;;
				;;	; 		  	    ymm1[3]    *         0                                                      ;;
				;;	; 		  	    ymm1[4]    *        -1                                                      ;;
				;;	; 		  	    ymm1[5]    *        -1                                                      ;;
				;;	; 		  	    ymm1[6]    *        -1                                                      ;;
				;;	; 		  	    ymm1[7]    *         0                                                      ;;
				;;                             |                                                                ;;
				;;	; ymm0[8] =     ymm0[8] vaddsubps ymm1[8]                                                   ;;
				;;	; 		  	    ymm0[0]     - 	  ymm1[0]                                                   ;;
				;;	; 		  	    ymm0[1]     + 	  ymm1[1]                                                   ;;
				;;	; 		  	    ymm0[2]     - 	  ymm1[2]                                                   ;;
				;;	; 		  	    ymm0[3]     + 	  ymm1[3]                                                   ;;
				;;	; 		  	    ymm0[4]     - 	  ymm1[4]                                                   ;;
				;;	; 		  	    ymm0[5]     + 	  ymm1[5]                                                   ;;
				;;	; 		  	    ymm0[6]     - 	  ymm1[6]                                                   ;;
				;;	; 		  	    ymm0[7]     + 	  ymm1[7]                                                   ;;
				;;                             |                                                                ;;
				;;	; ymm0[8] =     ymm0[8]  vmulps   ymm7[8]                                                   ;;
				;;	; 		 	    ymm0[0]    *        x                                                       ;;
				;;	; 		 	    ymm0[1]    *        y                                                       ;;
				;;	; 		 	    ymm0[2]    *        z                                                       ;;
				;;	; 		 	    ymm0[3]	   *      color                                                     ;;
				;;	; 		 	    ymm0[4]    *        x                                                       ;;
				;;	; 		 	    ymm0[5]    *        y                                                       ;;
				;;	; 		 	    ymm0[6]    *        z                                                       ;;
				;;	; 		 	    ymm0[7]	   *      color                                                     ;;
				;;                             |                                                                ;;
				;;	; ymm0[8] =     ymm0[8]  vhaddps   ymm0[8]                                                  ;;
				;;	; 		  	    ymm0[0]     -      ymm0[1]                                                  ;;
				;;	; 		  	    ymm0[2]     -      ymm0[3]                                                  ;;
				;;	; 		  	    ymm0[4]     -      ymm0[5]                                                  ;;
				;;	; 		  	    ymm0[6]     -      ymm0[7]                                                  ;;
				;;	; 		  	    ymm0[0]     -      ymm0[1]                                                  ;;
				;;	; 		  	    ymm0[2]     -      ymm0[3]                                                  ;;
				;;	; 		  	    ymm0[4]     -      ymm0[5]                                                  ;;
				;;	; 		  	    ymm0[6]     -      ymm0[7]                                                  ;;
				;;                              |                                                               ;;
				;;	; ymm0[8] =     ymm0[8]  vhaddps   ymm0[8]                                                  ;;
				;;	; 		  	    ymm0[0]     -      ymm0[1]                                                  ;;
				;;	; 		  	    ymm0[2]     -      ymm0[3]                                                  ;;
				;;	; 		  	    ymm0[4]     -      ymm0[5]                                                  ;;
				;;	; 		  	    ymm0[6]     -      ymm0[7]                                                  ;;
				;;	; 		  	    ymm0[0]     -      ymm0[1]                                                  ;;
				;;	; 		  	    ymm0[2]     -      ymm0[3]                                                  ;;
				;;	; 		  	    ymm0[4]     -      ymm0[5]                                                  ;;
				;;	; 		  	    ymm0[6]     -      ymm0[7]                                                  ;;
				;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
				;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
				vmovups		ymm0, [rotate_z_ymm0]
				vmovups		ymm1, [rotate_z_ymm1]
				
				vmulps		ymm0, ymm2			; ymm2 = rotate_yz_ymm2
				vmulps		ymm1, ymm3			; ymm3 = rotate_xyz_ymm3
				vmulps		ymm1, ymm4			; ymm4 = rotate_yz_ymm4
				vmulps		ymm1, ymm5			; ymm5 = rotate_z_ymm5
				
				vaddsubps	ymm0, ymm1			; vsubaddps

				vmulps		ymm0, ymm7			; ymm7 = coordonee
				
				vhaddps		ymm0, ymm0
				vhaddps		ymm0, ymm0

				; On save la new coordonée: z
				movss		[coordonee_after + 16], xmm0
				
					vmovups		[temp], ymm0
					movss		xmm0, [temp + _color + 4]
				movss		[coordonee_after + 20], xmm0
		;=============
		; / roll
		;=============
; ret			
;====================================================================================================
;END_FONCTIONS		END_FONCTIONS		  END_FONCTIONS	    	END_FONCTIONS	    	END_FONCTIONS	
;====================================================================================================

Bernard · ‎12-13-2014

It seems that single value is accessed every time. You can see it by looking at assembly. All those scalar vex-encoded instruction confirm this.

How trigo structure is declared? Is this SoA type of statically allocated arrays?Base pointer to trigo is loaded into R13 and member access of floats is done by incrementing pointer by 4. My advise is to use intrinsics in order to produce vector code.

Access like this trigo.cos will be probably translated into scalar code. I had similiar issue when I tried to access __m256d union type by its members that's mean double ar[4] array.

Anonymous · ‎12-13-2014

yes, it's SoA, intrinsics is a good way, but need to much code for use it, i will test asm inline (less code), now that i know i need to use AVX instruction for get no latency.

And what do you think about my way for vectorization helping (in commentary) ?

Bernard · ‎12-13-2014

Yes you asm version is very good.You are right you can use inline assembly of course. I agree with you that using intrinsics is not easy.

Anonymous · ‎12-15-2014

Do you know what those data correspond and extract aglorithme, seem difficult ^^:

		;=============================================================================================================
		 ; float sin[4], cos[4] sincosps (float angle_radians[4])
		 ; Calcule les fonctions sin et cos des 4 angles contenu dans angle_radians[4].
		 ; Entrée : angle_radians[4] xmm7
		 ; Sortie: sin[4] xmm0 et cos[4] xmm6
		 ; Destroyed: 	xmm0 - xmm1 - xmm2 - xmm3
		 ;				xmm4 - xmm5 - xmm6 - xmm7
		 ; DATA:
				_ps_am_inv_sign_mask	dd	0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF
				_ps_am_sign_mask		dd	-0.0, -0.0, -0.0, -0.0
				_ps_am_2_o_pi			dd	0.63661977, 0.63661977, 0.63661977, 0.63661977		; 2/PI
				_ps_am_1 				dd	1.0, 1.0, 1.0, 1.0
				_epi32_1				dd	1.0, 1.0, 1.0, 1.0
				_epi32_2 				dd	2.0, 2.0, 2.0, 2.0
				_ps_sincos_p3			dd	-0.00468175, -0.00468175, -0.00468175, -0.00468175
				_ps_sincos_p2 			dd	0.07969262, 0.07969262, 0.07969262, 0.07969262
				_ps_sincos_p1 			dd	-0.64596409, -0.64596409, -0.64596409, -0.64596409
				_ps_sincos_p0 			dd	1.57079632, 1.57079632, 1.57079632, 1.57079632		; PI/2
				
		;=============================================================================================================
		sincosps:
			vmovups		xmm0, [sincosps_angle_rad]			;;; xmm0 = angle
			vmovaps		xmm7, xmm0							;;; xmm7 = angle
			
			vandps 		xmm0, xmm0, [_ps_am_inv_sign_mask]	;;; xmm0 = abs(angle)
			vandps 		xmm7, xmm7, [_ps_am_sign_mask]		;;; xmm7 = neg(angle)
			
			vmulps 		xmm0, xmm0, [_ps_am_2_o_pi]			;;; xmm0 = (2*angle)/PI
			vpxor 		xmm3, xmm3, xmm3					;;; xmm3 = 0
			vmovups 	xmm5, [_epi32_1]					;;; xmm5 = 1
			vcvttps2dq	xmm2, xmm0							;;; xmm2 = (int)(0.63661977 * angle)
			vpand 		xmm5, xmm5, xmm2					;;; xmm5 = 1 AND xmm2
			vpcmpeqd 	xmm5, xmm5, xmm3					;;; xmm5 = if(xmm5 == xmm3)
															;;;			  xmm5 = 0xFFFFFFFF;
															;;;		   else
															;;;			  xmm5 = 0;
			vmovups 	xmm3, [_epi32_1]					;;; xmm3 = 1
			vcvtdq2ps 	xmm6, xmm2							;;; xmm6 = (float)xmm2
			vpaddd 		xmm3, xmm3, xmm2					;;; xmm3 = 1 + (int)(0.63661977 * angle)
			vpand 		xmm2, xmm2, [_epi32_2]				;;; xmm2 = xmm2 AND 2
			vpand 		xmm3, xmm3, [_epi32_2]				;;; xmm3 = xmm3 AND 2
			vsubps 		xmm0, xmm0, xmm6					;;; xmm0 = xmm0 - xmm6
			vpslld 		xmm2, xmm2, 30						;;; xmm2 = xmm2 * 1073741824
			vminps 		xmm0, xmm0, [_ps_am_1]				;;; xmm0 = min(xmm0, 1)
			vmovups 	xmm4, [_ps_am_1]					;;; xmm4 = 1
			vsubps 		xmm4, xmm4, xmm0					;;; xmm4 = 1 - xmm0
			vpslld 		xmm3, xmm3, 30						;;; xmm3 = xmm3 * 1073741824
			
			vmovaps 	xmm6, xmm4							;;; xmm6 = xmm4
			vxorps 		xmm2, xmm2, xmm7					;;; xmm2 = xmm2 XOR -angle
			vmovaps 	xmm7, xmm5							;;; xmm7 = xmm5
			vandps 		xmm6, xmm6, xmm7					;;; xmm6 = xmm6 AND xmm7
			vandnps 	xmm7, xmm7, xmm0					;;; xmm7 = xmm7 NAND xmm0
			vandps 		xmm0, xmm0, xmm5					;;; xmm0 = xmm0 AND xmm5
			vandnps 	xmm5, xmm5, xmm4					;;; xmm5 = xmm5 NAND xmm4
			vorps 		xmm6, xmm6, xmm7					;;; xmm6 = xmm6 OR xmm7
			vorps 		xmm0, xmm0, xmm5					;;; xmm0 = xmm0 OR xmm5
			
			vmovaps		xmm1, xmm0							;;; xmm1 = xmm0
			vmovaps		xmm7, xmm6							;;; xmm7 = xmm6
			vmulps 		xmm0, xmm0, xmm0					;;; xmm0 = xmm0 * xmm0
			vmulps 		xmm6, xmm6, xmm6					;;; xmm6 = xmm6 * xmm6
			vorps 		xmm1, xmm1, xmm2					;;; xmm1 = xmm1 OR xmm2
			vorps 		xmm7, xmm7, xmm3					;;; xmm7 = xmm7 OR xmm3
			
			vmovaps 	xmm2, xmm0							;;; xmm2 = xmm0
			vmovaps		xmm3, xmm6							;;; xmm3 = xmm6
			vmulps 		xmm0, xmm0, [_ps_sincos_p3]			;;; xmm0 = xmm0 * xmm4
			vmulps 		xmm6, xmm6, [_ps_sincos_p3]			;;; xmm6 = xmm6 * xmm4
			vaddps 		xmm0, xmm0, [_ps_sincos_p2]			;;; xmm0 = xmm0 + 0.07969262
			vaddps 		xmm6, xmm6, [_ps_sincos_p2]			;;; xmm6 = xmm6 + 0.07969262
			vmulps 		xmm0, xmm0, xmm2					;;; xmm0 = xmm0 * xmm0
			vmulps 		xmm6, xmm6, xmm3					;;; xmm6 = xmm6 * xmm6
			vaddps 		xmm0, xmm0, [_ps_sincos_p1]			;;; xmm0 = xmm0 + -0.64596409
			vaddps 		xmm6, xmm6, [_ps_sincos_p1]			;;; xmm6 = xmm6 + -0.64596409
			vmulps 		xmm0, xmm0, xmm2					;;; xmm0 = xmm0 * xmm0
			vmulps 		xmm6, xmm6, xmm3					;;; xmm6 = xmm6 * xmm6
			vaddps 		xmm0, xmm0, [_ps_sincos_p0]			;;; xmm0 = xmm0 + PI/2
			vaddps 		xmm6, xmm6, [_ps_sincos_p0]			;;; xmm6 = xmm6 + PI/2
			vmulps 		xmm0, xmm0, xmm1					;;; xmm0 = xmm0 * xmm1
			vmulps 		xmm6, xmm6, xmm7					;;; xmm6 = xmm6 * xmm7
			
			vmovups		[sincosps_sin], xmm0 				;;; sinus(xmm0)
			vmovups		[sincosps_cos], xmm6 				;;; cosinus(xmm0)
		ret
		;=============================================================================================================
		; / sincosps
		;=============================================================================================================

Bernard · ‎12-15-2014

I think that this is sincos function which operates on float data.

Bernard · ‎12-16-2014

Are you looking for a sincos formula?

Anonymous · ‎12-16-2014

Yes.

(*)And i have try to assemble all my asm code with intel compiler through __asm{}, and unfortunnaly i bring 44 fps, i said unfortunnaly cause when i use nasm with same code, i have 17 fps -_-

So like i learned, assembly language cannot be optimized cause it's the first low level language after machine language. But this theory seem not correct -_- And i'm not enthusiasm to learn machine language :D

And for C code, i examined again his assembly code and found a new thing: icl draw 4 pixel per putpixel(), and me only 1 and two with ymm register but seem bug. I don't know exactly it do it, but i will look more later :/

*I have upload the source code wich contain all my asm code, i haven't keep some basic function of WinAPI, printf ect in asm form cause doesn't matter. You can look only fps print for see performance of asm version, i have a little problem of cube's display.

Anonymous · ‎12-16-2014

Finnaly, it's a wrong alert, i have bring same fps than intel compiler (~44) by converting remaining SSE instruction to AVX.

			vmovss		xmm0, [rsi + _color]
			vmovss		xmm1, [rsi + (_color * 2) + _next]
			; Transfert de la couleurs au pixel visée par le calcul de:
			; REPERE - (PITCH * y + x * 4)
				vmovd		[rdi + r8], xmm0
				vmovd		[rdi + r9], xmm1

Bernard · ‎12-17-2014

I cannot find sincos formula/algorithm. I think that last part of that code corresponds to some kind of Horner scheme which is used to calculate polynomials.