Solved: I have re write all asm code - Page 10

Anonymous · ‎09-02-2014

Hello there,

ok, here we go, I have a dream, make a 3D engine 100% assembler intel only with CPU, I use rotation matrix only for now.

it works of course, but it's slow when I put a lot of pixels.

Recently I decided to include voxels in my engine, and it's slow when I put> = 8000 voxels (20 * 20 * 20 cube) and when I saw that nvidia display 32M voxels (fire) I wonder how they can do it !

And I have a little idea of the reason: MMU, paging, segmentation. memory.

Am I right?

Another question, is the FPU is the slowest to compute floating point than SSE or depending of data manipulate ?

PS: I work without OS like Windows or Linux, I run on my own kernel + bootloader in assembly too with NASM.

Sorry if i don't wirte a good english, i'm french and use google translate ^-^

Bradley_W_Intel · ‎09-02-2014

You clearly are using the processor in a very advanced way. I will do my best to answer your questions:

1) Why is your voxel engine not able to efficiently render as many voxels as you'd like? Voxel engines need to maximize their use of parallelism (both threading and SIMD) and also to store the data efficiently in an octree or some other structure that can handle sparse data. If you are doing all these things and still not getting the performance you expect, it's an optimization problem. Some Intel tools like VTune Performance Analyzer are excellent for performance analysis.

2) Is single data floating point math faster than SIMD (if I understood you)? Typically SIMD will be faster than single data instructions if your data is laid out in a way that supports the SIMD calls. In all cases, the only way for you to know for certain which way is faster is to test it.

3) How can you select between discrete and processor graphics? DirectX has methods of enumerating adapters. In such a case, the processor graphics is listed separately from the discrete graphics. If you are choosing your adapter based on the amount of available memory, you may be favoring the processor graphics when you didn't intend to. Intel has sample code that shows how to properly detect adapters in DirectX at https://software.intel.com/en-us/vcsource/samples/gpu-detect. The process for OpenGL is not well documented.

4) Can I use one processor to control execution of a second processor? Probably not. The details on Intel processors are covered at http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html. It's possible, though unlikely, that you'll be able to find something in there that can help you.

View solution in original post

Bernard · ‎12-18-2014

>>>And i have try to assemble all my asm code with intel compiler through __asm{}, and unfortunnaly i bring 44 fps, i said unfortunnaly cause when i use nasm with same code, i have 17 fps -_->>>

Compiler will not optimize inline assembly code you must know that.As I understood ICC compiled code was faster achieving 44 fps?

>>>So like i learned, assembly language cannot be optimized cause it's the first low level language after machine language. But this theory seem not correct -_- And i'm not enthusiasm to learn machine language>>>

Do you mean inline assembly or assembly in general?

Btw, it can be optimized by it require from the programmer deep knowledge of the target CPU. Please refer to Michael Abrash book about the assembly code optimization for 3D graphics.

http://blogs.valvesoftware.com/abrash/

http://www.amazon.com/Michael-Abrashs-Graphics-Programming-Special/dp/1576101746

Anonymous · ‎12-18-2014

For bring 44 fps, ICL was faster, but there is a strange thing, ICL achieving those 44 fps with remaining SSE instruction that i changed yesterday to AVX in NASM:

In ICL:

			vmovss		xmm0, [rsi + _color]
			vmovss		xmm1, [rsi + (_color * 2) + _next]
			; Transfert de la couleurs au pixel visée par le calcul de :
		; REPERE - (PITCH * y + x * 4)
			movd	[rdi + r8], xmm0
			movd	[rdi + r9], xmm1

In NASM:

			; Transfert de la couleurs au pixel visée par le calcul de:
			; REPERE - (PITCH * y + x * 4)
					vmovd	xmm11, [rsi + _color]
				vmovd		[screen + REPERE + r8], xmm11
				
					vmovd	xmm11, [rsi + (_color * 2) + _next]
				vmovd		[screen + REPERE + r9], xmm11

In all logic, we can say ICL optimize a little asm code for translate to AVX instruction, even NASM have optimization flag, but i don't know how to see changes apart analyse machine language.

And for the problem of different FPS: intel compiler vs NASM. Do you thing ICL use another technologie than vectorization like parallelization of code which i dont have any knowledge, there is cache management too, but don't know it too. (CPU technology are really complicate and complexe but so fascinating ^^)

Anonymous · ‎12-18-2014

Do you know any course on web who teach CPU Intel technology with assembly way or i still need to learn through intel software documentation :/

Bernard · ‎12-18-2014

>>>And i'm not enthusiasm to learn machine language :D>>>

Assembly language is direct representation of machine code.

Bernard · ‎12-19-2014

I do not understand why you used integer instructions in NASM code? You have register pressure in NASM version when one register is used to load and store data. It is working because CPU internally renames architectural register and allocates physical register to handle the data.I suppose that NASM version can not be executed in parallel because of dependency on xmm11 register. Try to use addidtional register recompile and compare both versions of the code.

Bernard · ‎12-19-2014

shaynox s. wrote:

Do you know any course on web who teach CPU Intel technology with assembly way or i still need to learn through intel software documentation :/

I do not know any web based course.You can download and read M. Abrash book about the 3D optimization it is outdated today , but it has a plenty info. Additional option is to use of course Intel documentation and IDZ forums. This way I am learning.

Bernard · ‎12-20-2014

>>>doesn't take any change, by put other register.>>>

Rewrite the code exactly as ICL code.

>>>But i don't understand about the parallelization technologie, What does it do really ?>>>

I suppose that you asked about ILP = "Instruction Level Parallelism".

http://en.wikipedia.org/wiki/Instruction-level_parallelism

Anonymous · ‎12-21-2014

You want, i write all asm code mady by ICL in NASM version ?

Is it normal i don't find anything about technical parallelization in Intel® 64 and IA-32 Architectures Software Developer’s Manuals ?

Bernard · ‎12-23-2014

>>>You want, i write all asm code mady by ICL in NASM version ?>>>

Why not sometimes I am also doing the same.

>>>ing about technical parallelization in Intel® 64 and IA-32 Architectures Software Developer’s Manuals ?>>>

You should look at Software Optimization Manual.

Anonymous · ‎12-27-2014

Is it right CPU managed his cores automatically ?

Or we can use it manually for get/programming real mutlitasking (for # cores/programs ^^)

Bernard · ‎12-27-2014

shaynox s. wrote:

Is it right CPU managed his cores automatically ?

Or we can use it manually for get/programming real mutlitasking (for # cores/programs ^^)

CPU manages its cores at very low level. For example power management can be programmed by OS software , but exact algorithm is inaccessible for the software.

Regarding mulitthreading programmer with the help of OS is in charge and not a CPU.

Bernard · ‎12-27-2014

For multithreading programming consider OpenMP.

http://openmp.org/wp/

Anonymous · ‎01-12-2015

I have re write all asm code produce by icl in NASM version, and have same fps than my asm code, so i suspect the /Qipo option produce the right code, moreover /Qipo is necessary to run programme without crash at start.

Maybe intel don't want to let user see some optimization's secrets :)

Bernard · ‎01-14-2015

I think that /Qipo could have helped to optimize out your code. Please refer to this document https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-F72F0700-46DA-4FB7-9B73-6ADC12F9D086.htm

Btw, did you replaced your hand written NASM assembly with the code generated by the Intel Compiler?

Anonymous · ‎01-14-2015

yes, but i miss to tell the program don't want to run, cause like i remove /Qipo option, even if we reproduce all asm code of icl, it crash, so nvm.

(I think is just a stack declaration problem)

And about sincos function, i decide to get back of FPU technologie, and don't have pb of latency:

		%define		_1ps_1				0
		%define		_1ps_2				4
		%define		_1ps_3				8
		%define		_1ps_4				12
		%define		_2ps_1				16
		%define		_2ps_2				20
		%define		_2ps_3				24
		%define		_2ps_4				28	
		;=============================================================================================================
		 ; fsincosps(float fsincosps_angle[4])
		 ; Calcule les fonctions sin et cos des 4 angles contenu dans fsincosps_angle[4].
		 ; Entrée : fsincosps_angle[4]
		 ; Sortie: fsincosps_cos[4], fsincosps_sin[4]
		 ; Destroyed:
		 ; DATA:
		 		pi_180		dd	0.01745329
							dd	0.01745329
							dd	0.01745329
							dd	0	
		;=============================================================================================================
		fsincosps:
			; Change angle degree to radian
				vmovdqu		xmm0, [fsincosps_angle]
				vmulps		xmm0, [pi_180]
				vmovdqu		[fsincosps_angle], xmm0
				
			fld		dword [fsincosps_angle + _1ps_1]
			fsincos
				fstp 	dword [fsincosps_cos + _1ps_1]
				fstp	dword [fsincosps_sin + _1ps_1]

			fld		dword [fsincosps_angle + _1ps_2]
			fsincos
				fstp 	dword [fsincosps_cos + _1ps_2]
				fstp 	dword [fsincosps_sin + _1ps_2]

			fld		dword [fsincosps_angle + _1ps_3]
			fsincos
				fstp 	dword [fsincosps_cos + _1ps_3]
				fstp 	dword [fsincosps_sin + _1ps_3]

			fld		dword [fsincosps_angle + _1ps_4]
			fsincos
				fstp 	dword [fsincosps_cos + _1ps_4]
				fstp 	dword [fsincosps_sin + _1ps_4]
		ret
		;=============================================================================================================
		; / fsincosps
		;=============================================================================================================

PS: I have upload an update of my program (nasm).

Anonymous · ‎01-15-2015

Back, if you are interrested, i have wrote a usefull macro for use printf easy as C:

http://forum.nasm.us/index.php?topic=2036.0

Sorry i'm little lazy for re-write all code here :p.

Anonymous · ‎01-16-2015

3-Volume 2B Instruction Set Reference, N-Z (chap OUT)

Is it normal in these intel doc, it's wrote, 64-bit mode valid ?

Like i learned, it's impossible to do this in x64 OS windows or own OS, so who say the truth ?

And i think of driver (dll) using in windows (gdi32.dll, ..), they use PMIO or MMIO for dialog with hardware (video card).

Is it possible to read memory left to right:

array:		dd	1			; 0
			dd  2			; 4
			dd  3			; 8
			dd  4			; 12
			dd  5			; 16
			dd  6			; 20
			dd  7			; 24
			dd  8			; 28
vmovups      ymm0, [array]

Give:
ymm0[0] = 8;
ymm0[1] = 7;
ymm0[2] = 6;
ymm0[3] = 5;
ymm0[4] = 4;
ymm0[5] = 3;
ymm0[6] = 2;
ymm0[7] = 1;

instead right to left:

ymm0[0] = 1;
ymm0[1] = 2;
ymm0[2] = 3;
ymm0[3] = 4;
ymm0[4] = 5;
ymm0[5] = 6;
ymm0[6] = 7;
ymm0[7] = 8;

Bernard · ‎01-17-2015

>>>even if we reproduce all asm code of icl, it crash, so nvm.>>>

Does it crash on Windows? Can you upload failed process dump file?

Bernard · ‎01-17-2015

shaynox s. wrote:

Back, if you are interrested, i have wrote a usefull macro for use printf easy as C:

http://forum.nasm.us/index.php?topic=2036.0

Sorry i'm little lazy for re-write all code here :p.

Nice stuff thanks for providing link.

Btw, I will upload my simple program (still at alpha stage) for particle modelling. I tried to use vectorization where it applies.

Bernard · ‎01-17-2015

>>>And i think of driver (dll) using in windows (gdi32.dll, ..), they use PMIO or MMIO for dialog with hardware (video card).>>>

AFAIK gdi32.dll calls into win32k.sys driver windowing driver which in turn communicates with display driver.

Anonymous · ‎01-17-2015

So if i call function into win32k.sys, it will be faster than gdi32.dll's function ?

(http://msdn.microsoft.com/en-us/library/windows/hardware/ff564185%28v=vs.85%29.aspx)

And i learned dll is like sys cause i have open win32k.sys into http://www.nirsoft.net/utils/dll_export_viewer.html by modify extension: win32k.dll.

For upload failed process dump file, i can't do that now (reinstall vs2013).

And i have a good new, i have found a new way for make rotation (thanks http://abreojosensamblador.net/Productos/AOWG/html/Pags_en/Chap04.html. sub-chapter: 4.1.7.2.4. Relative to the three axes)

And by this way, i do only 6 calculations instead 20 with precedent way:

	; Indice for array
		%define		_1x					0
		%define		_1y					4
		%define		_1z					8
		%define		_1color				12
		%define		_2x					16
		%define		_2y					20
		%define		_2z					24
		%define		_2color				28
		%define		_3x					32
		%define		_3y					36
		%define		_3z					40
		%define		_3color				44
		%define		_4x					48
		%define		_4y					52
		%define		_4z					56
		%define		_4color				60

			; Duplicate cos(x)
				vbroadcastss		ymm0, [fsincosps_cos + _x]
			; Duplicate cos(y)
				vbroadcastss		ymm1, [fsincosps_cos + _y]
			; Duplicate cos(z)
				vbroadcastss		ymm2, [fsincosps_cos + _z]

			; Duplicate sin(x)
				vbroadcastss		ymm3, [fsincosps_sin + _x]
			; Duplicate sin(y)
				vbroadcastss		ymm4, [fsincosps_sin + _y]
			; Duplicate sin(z)
				vbroadcastss		ymm5, [fsincosps_sin + _z]

make_rotate:
		;--------------------------------------------------------.
		; 						X-axe                           ;|
		; 		 y' =  (y * cos(phi_x)) -  (z * sin(phi_x))     ;| 1 0
		; 		 z' =  (z * cos(phi_x)) +  (y * sin(phi_x))     ;| 2 4
		; 		2y' = (2y * cos(phi_x)) - (2z * sin(phi_x))     ;| 3 8
		;		2z' = (2z * cos(phi_x)) + (2y * sin(phi_x))     ;| 4 12
		; 		3y' = (3y * cos(phi_x)) - (3z * sin(phi_x))     ;| 5 16
		; 		3z' = (3z * cos(phi_x)) + (3y * sin(phi_x))     ;| 6 20
		; 		4y' = (4y * cos(phi_x)) - (4z * sin(phi_x))     ;| 7 24
		;		4z' = (4z * cos(phi_x)) + (4y * sin(phi_x))     ;| 8 28
		;                                                       ;|
		; 		 y =  y'     3y = 3y'                           ;|
		;		 z =  z'     3z = 3z'                           ;|
		;		2y = 2y'     4y = 4y'                           ;|
		;		2z = 2z'     4z = 4z'                           ;|
		;-------------------------------------------------------;|
		;                                                       ;|
		;-------------------------------------------------------;|
		; 						Y-axe                           ;|
		; End. z  =  z' =  (z * cos(phi_y)) -  (x * sin(phi_y)) ;| 1 0
		;			 x' =  (x * cos(phi_y)) +  (z * sin(phi_y)) ;| 2 4
		; End.2z  = 2z' = (2z * cos(phi_y)) - (2x * sin(phi_y)) ;| 3 8
		;			2x' = (2x * cos(phi_y)) + (2z * sin(phi_y)) ;| 4 12
		; End.3z  = 3z' = (3z * cos(phi_y)) - (3x * sin(phi_y)) ;| 5 16
		;			3x' = (3x * cos(phi_y)) + (3z * sin(phi_y)) ;| 6 20
		; End.4z  = 4z' = (4z * cos(phi_y)) - (4x * sin(phi_y)) ;| 7 24
		;			4x' = (4x * cos(phi_y)) + (4z * sin(phi_y)) ;| 8 28
		;                                                       ;|
		;			 x =  x'                                    ;|
		;			2x = 2x'                                    ;|
		;			3x = 3x'                                    ;|
		;			4x = 4x'                                    ;|
		;-------------------------------------------------------;|
		;                                                       ;|
		;-------------------------------------------------------;|
		; 						Z-axe                           ;|
		; End. x  =  x' =  (x * cos(phi_z)) -  (y * sin(phi_z)) ;| 1 0
		; End. y  =  y' =  (y * cos(phi_z)) +  (x * sin(phi_z)) ;| 2 4
		; End.2x  = 2x' = (2x * cos(phi_z)) - (2y * sin(phi_z)) ;| 3 8
		; End.2y  = 2y' = (2y * cos(phi_z)) + (2x * sin(phi_z)) ;| 4 12
		; End.3x  = 3x' = (3x * cos(phi_z)) - (3y * sin(phi_z)) ;| 5 16
		; End.3y  = 3y' = (3y * cos(phi_z)) + (3x * sin(phi_z)) ;| 6 20
		; End.4x  = 4x' = (4x * cos(phi_z)) - (4y * sin(phi_z)) ;| 7 24
		; End.4y  = 4y' = (4y * cos(phi_z)) + (4x * sin(phi_z)) ;| 8 28
		;--------------------------------------------------------.

		; Store i = 0::Loop(i <= 4 , j <= 8){ ymm = iY; j++; ymm = iZ; i++; j++;}
			vmovlps		xmm6, [rotate_rsi + _1y]
			vmovlps		xmm7, [rotate_rsi + _2y]
			vmovlps		xmm8, [rotate_rsi + _3y]
			vmovlps		xmm9, [rotate_rsi + _4y]
			vmovlps		[rotate_x_yz +  0], xmm6
			vmovlps		[rotate_x_yz +  8], xmm7
			vmovlps		[rotate_x_yz + 16], xmm8
			vmovlps		[rotate_x_yz + 24], xmm9
			
		; Store i = 0::Loop(i <= 4 , j <= 8){ ymm = iZ; j++; ymm = iY; i++; j++;}
			vextractps	dword [rotate_x_zy +  0], xmm6, 1		; _1z
			vmovss		dword [rotate_x_zy +  4], xmm6		; _1y
			vextractps	dword [rotate_x_zy +  8], xmm7, 1		; _2z
			vmovss		dword [rotate_x_zy + 12], xmm7		; _2y
			vextractps	dword [rotate_x_zy + 16], xmm8, 1		; _3z
			vmovss		dword [rotate_x_zy + 20], xmm8		; _3y
			vextractps	dword [rotate_x_zy + 24], xmm9, 1		; _4z
			vmovss		dword [rotate_x_zy + 28], xmm9		; _4y

		; X-axe			ymm6			   ymm7
			; y' = (y * cos(phi_x)) - (z * sin(phi_x))
			; z' = (z * cos(phi_x)) + (y * sin(phi_x))
					vmulps		ymm6, ymm0, [rotate_x_yz]		; ymm6 * cos(x)
					vmulps		ymm7, ymm3, [rotate_x_zy]		; ymm7 * sin(x)
				vaddsubps		ymm6, ymm7

			vmovups		[moveobject_tmp], ymm6
			; y = y'  ymm6
			; z = z'  ymm6
			vmovss		xmm7,  [rotate_rsi + _1x]
			vmovss		xmm8,  [rotate_rsi + _2x]
			vmovss		xmm9,  [rotate_rsi + _3x]
			vmovss		xmm10, [rotate_rsi + _4x]
			; Store i = 0::Loop(i <= 4 , j <= 8){ ymm = iZ; j++; ymm = iX; i++; j++;}
				vmovups		[rotate_y_zx -  4], ymm6
				vmovss		[rotate_y_zx +  4], xmm7
				vmovss		[rotate_y_zx + 12], xmm8
				vmovss		[rotate_y_zx + 20], xmm9
				vmovss		[rotate_y_zx + 28], xmm10

			; Store i = 0::Loop(i <= 4 , j <= 8){ ymm = iX; j++; ymm = iZ; i++; j++;}
				vmovups		[rotate_y_xz +  0], ymm6
				vmovss		[rotate_y_xz +  0], xmm7
				vmovss		[rotate_y_xz +  8], xmm8
				vmovss		[rotate_y_xz + 16], xmm9
				vmovss		[rotate_y_xz + 24], xmm10

		; Y-axe			ymm6			   ymm7
			; z' = (z * cos(phi_y)) - (x * sin(phi_y))
			; x' = (x * cos(phi_y)) + (z * sin(phi_y))
					vmulps		ymm6, ymm1, [rotate_y_zx]		; ymm6 * cos(y)
					vmulps		ymm7, ymm4, [rotate_y_xz]		; ymm7 * sin(y)
				vaddsubps		ymm6, ymm7

			; x = x'  xmm3
			vmovss		xmm7 , [moveobject_tmp]
			vmovss		xmm8 , [moveobject_tmp +  8]
			vmovss		xmm9 , [moveobject_tmp + 16]
			vmovss		xmm10, [moveobject_tmp + 24]
			; Store i = 0::Loop(i <= 4 , j <= 8){ ymm = iX; j++; ymm = iY; i++; j++;}
				vmovups		[rotate_z_xy -  4], ymm6
				vmovss		[rotate_z_xy +  4], xmm7
				vmovss		[rotate_z_xy + 12], xmm8
				vmovss		[rotate_z_xy + 20], xmm9
				vmovss		[rotate_z_xy + 28], xmm10

			; Store i = 0::Loop(i <= 4 , j <= 8){ ymm = iY; j++; ymm = iX; i++; j++;}
				vmovups		[rotate_z_yx +  0], ymm6
				vmovss		[rotate_z_yx +  0], xmm7
				vmovss		[rotate_z_yx +  8], xmm8
				vmovss		[rotate_z_yx + 16], xmm9
				vmovss		[rotate_z_yx + 24], xmm10

			; Save z
				vmovups		[moveobject_tmp], ymm6
				vmovss		xmm6, [moveobject_tmp +  0]
				vmovss		xmm7, [moveobject_tmp +  8]
				vmovss		xmm8, [moveobject_tmp + 16]
				vmovss		xmm9, [moveobject_tmp + 24]
				vmovss		[rbx + _1z], xmm6
				vmovss		[rbx + _2z], xmm7
				vmovss		[rbx + _3z], xmm8
				vmovss		[rbx + _4z], xmm9

		; Z-axe			ymm6			   ymm7
			; x' = (x * cos(phi_z)) - (y * sin(phi_z))
			; y' = (y * cos(phi_z)) + (x * sin(phi_z))
					vmulps		ymm6, ymm2, [rotate_z_xy]		; ymm6 * cos(z)
					vmulps		ymm7, ymm5, [rotate_z_yx]		; ymm7 * sin(z)
				vaddsubps		ymm6, ymm7

		; Save x y
			vmovups		[moveobject_tmp], ymm6
			vmovlps		xmm6, [moveobject_tmp +  0]
			vmovlps		xmm7, [moveobject_tmp +  8]
			vmovlps		xmm8, [moveobject_tmp + 16]
			vmovlps		xmm9, [moveobject_tmp + 24]
			vmovlps		[rbx + _1x], xmm6
			vmovlps		[rbx + _2x], xmm7
			vmovlps		[rbx + _3x], xmm8
			vmovlps		[rbx + _4x], xmm9

	rotate_rsi:			dd	0		; - p1
						dd	0
						dd	0
						dd	0
						dd	0		; - p2
						dd	0
						dd	0
						dd	0
						dd	0		; - p3
						dd	0
						dd	0
						dd	0
						dd	0		; - p4
						dd	0
						dd	0
						dd	0

	moveobject_tmp:		dd  0		; - p1
						dd  0
						dd  0
						dd  0
						dd  0		; - p2
						dd  0
						dd  0
						dd  0
	; X-axe:
		rotate_x_yz:	dd	0			;  y  0
						dd  0			;  Z  4
						dd  0			; 2y  8
						dd  0			; 2z 12
						dd  0			; 3y 16
						dd  0			; 3z 20
						dd  0			; 4y 24
						dd  0			; 4z 28

		rotate_x_zy:	dd  0			;  z  0
						dd  0			;  y  4
						dd  0			; 2z  8
						dd  0			; 2y 12
						dd  0			; 3z 16
						dd  0			; 3y 20
						dd  0			; 4z 24
						dd  0			; 4y 28
	; Y-axe:
						dd	0
		rotate_y_zx:	dd	0			;  z  0
						dd  0			;  x  4
						dd  0			; 2z  8
						dd  0			; 2x 12
						dd  0			; 3z 16
						dd  0			; 3x 20
						dd  0			; 4z 24
						dd  0			; 4x 28
							
		rotate_y_xz:	dd  0			;  x  0
						dd  0			;  z  4
						dd  0			; 2x  8
						dd  0			; 2z 12
						dd  0			; 3x 16
						dd  0			; 3z 20
						dd  0			; 4x 24
						dd  0			; 4z 28
	; Z-axe:
						dd  0
		rotate_z_xy:	dd  0			;  x  0
						dd  0			;  y  4
						dd  0			; 2x  8
						dd  0			; 2y 12
						dd  0			; 3x 16
						dd  0			; 3y 20
						dd  0			; 4x 24
						dd  0			; 4y 28
							
		rotate_z_yx:	dd  0			;  y  0
						dd  0			;  x  4
						dd  0			; 2y  8
						dd  0			; 2x 12
						dd  0			; 3y 16
						dd  0			; 3x 20
						dd  0			; 4y 24
						dd  0			; 4x 28

But like you see, i do a lot of vmovss, cause i do a special instruction (swap: E.0 with E.1, E.2 with E.3, ...) who don't exist on CPU as i know :/

And surprise i calculate 4 pixels by calling this function.

PS: thanks for your programs.