Solved: Back, is there an instruction - Page 2

Anonymous · ‎09-02-2014

Hello there,

ok, here we go, I have a dream, make a 3D engine 100% assembler intel only with CPU, I use rotation matrix only for now.

it works of course, but it's slow when I put a lot of pixels.

Recently I decided to include voxels in my engine, and it's slow when I put> = 8000 voxels (20 * 20 * 20 cube) and when I saw that nvidia display 32M voxels (fire) I wonder how they can do it !

And I have a little idea of the reason: MMU, paging, segmentation. memory.

Am I right?

Another question, is the FPU is the slowest to compute floating point than SSE or depending of data manipulate ?

PS: I work without OS like Windows or Linux, I run on my own kernel + bootloader in assembly too with NASM.

Sorry if i don't wirte a good english, i'm french and use google translate ^-^

Bradley_W_Intel · ‎09-02-2014

You clearly are using the processor in a very advanced way. I will do my best to answer your questions:

1) Why is your voxel engine not able to efficiently render as many voxels as you'd like? Voxel engines need to maximize their use of parallelism (both threading and SIMD) and also to store the data efficiently in an octree or some other structure that can handle sparse data. If you are doing all these things and still not getting the performance you expect, it's an optimization problem. Some Intel tools like VTune Performance Analyzer are excellent for performance analysis.

2) Is single data floating point math faster than SIMD (if I understood you)? Typically SIMD will be faster than single data instructions if your data is laid out in a way that supports the SIMD calls. In all cases, the only way for you to know for certain which way is faster is to test it.

3) How can you select between discrete and processor graphics? DirectX has methods of enumerating adapters. In such a case, the processor graphics is listed separately from the discrete graphics. If you are choosing your adapter based on the amount of available memory, you may be favoring the processor graphics when you didn't intend to. Intel has sample code that shows how to properly detect adapters in DirectX at https://software.intel.com/en-us/vcsource/samples/gpu-detect. The process for OpenGL is not well documented.

4) Can I use one processor to control execution of a second processor? Probably not. The details on Intel processors are covered at http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html. It's possible, though unlikely, that you'll be able to find something in there that can help you.

View solution in original post

Anonymous · ‎09-11-2014

for other function of VESA/Core, read VBE Core 3.0 [Sep. 1998].pdf, use google for found it

Bernard · ‎09-12-2014

>>>Typically SIMD will be faster than single data instructions if your data is laid out in a way that supports the SIMD calls. In all cases, the only way for you to know for certain which way is faster is to test it.>>>

Usually for 3D rendering very effective approach for data layout is to use SOA "Structure of Arrays" where each 1D array can represent pixel colour and alpha value.By using this approach you are maximizing data space locality with the help of array linearity.

Bernard · ‎09-12-2014

>>>Because i would like to run a game without OS, or mine (in development) and without directx or opengl, sound like chalenge, and i like to be free (love assembler) untill i see pagination mechanism.>>>

Kudos to you quite big achievements.

I still think that for advanced design of 3D engine like simulation of cloth or simulation of rendering equation C or C++ would be easier to use.

Bernard · ‎09-12-2014

>>>Strange, and yes even if movdqu is faster, it's still low: i mean i see the execution of this function on screen.>>>

Did you try to use vmovapd instruction for faster operation? Before using this instruction try to align the data on 32-byte boundaries : align 32 directives.

Bernard · ‎09-12-2014

>>>FInnaly for optimization, i forgot to tell you i use eMachines 350 with Intel® Atom™ Processor N450 (512K Cache, 1.66 GHz) for test :D>>>

Do you an access to more powerful CPU like Core i7 Haswell.

Anonymous · ‎09-12-2014

Back,

Well i have found the problem, it wasn't the CPU fault, but CRTC information block through vesa mechanism :D

I test my engine with 800 * 600 resolution, this data block is managed with Set VBE Mode fucntion:

			; Set VBE Mode
				mov		ax, 0x4F02
				mov		bx, 0x4915			; Desired Mode to set, 0x115 = 800 * 600
											; D0-D8		=  		Mode number 
											; D9-D10 	= 0		Reserved (must be 0)
											; D11 		= 0 	Use current default refresh rate
											; 			= 1  	Use user specified CRTC values for refresh rate
											; D12-13 	= 0 	Reserved for VBE/AF (must be 0)
											; D14 		= 0 	Use windowed frame buffer model
											; 			= 1		Use linear/flat frame buffer model
											; D15  		= 0 	Clear display memory
											; 			= 1		Don't clear display memory
				mov		di, CRTCInfoBlock
				int		0x10

	CRTCInfoBlock:
		HorizontalTotal 			dw 	800				; Horizontal total in pixels
		HorizontalSyncStart 		dw 	0				; Horizontal sync start in pixels
		HorizontalSyncEnd 			dw	0				; Horizontal sync end in pixels
		VerticalTotal 				dw 	600 			; Vertical total in lines
		VerticalSyncStart			dw 	0				; Vertical sync start in lines
		VerticalSyncEnd 			dw 	0				; Vertical sync end in lines
		Flags 						db 						0_0_0_0_0_0_0_0b	; Flags (Interlaced, Double Scan etc)
														;	| | | | | | | `----- Double Scan ModeEnable:	0 =  Graphics mode is not double scanned		1 =  Graphics mode is double scanned
														;	| | | | | | `------ Interlaced Mode Enable:		0 =  Graphics mode is non-interlaced 			1 =  Graphics mode is interlaced
														;	| | | | | `------- Horizontal sync polarity:	0 =  Horizontal sync polarity is positive (+) 	1 =  Horizontal sync polarity is negative (-)
														;	| | | | `-------- Vertical sync polarity: 		0 =  Vertical sync polarity is positive (+)		1 =  Vertical sync polarity is negative (-)
														;	| | | `--------- Ignored
														;	| | `---------- Ignored 
														;	| `----------- Ignored
														;	`------------ Ignored
		PixelClock 					dd 	28_800_000	; Pixel clock in units of Hz
		RefreshRate					dw 	6000		; Refresh rate in units of 0.01 Hz
		times	40					db	0 			; remainder of ModeInfoBlock

                                   PixelClock                                              28_800_000
 refreshRate (0.01 Hz) = _______________________________ =>  6000 * 0.01 Hz (60Hz) = ______________________
                         HorizontalTotal * VerticalTotal                                   800 * 600

(VBE Core 3.0 [Sep. 1998].pdf)

And per magic, it's fluid, i don't see anymore the clearing of my screen, but still a little bit problem* for the reason i don't get it all those information.

I juste enter those value just with basic knowledge of FPS,

But i know those data was created for ancestor Cathode Ray Tube, and i don't get it, why modern video card keep this display's method.

* = evey second i see a slowly clear screen, stange

Anonymous · ‎09-12-2014

And for AVX instruction, i will use it later

Anonymous · ‎09-12-2014

Later because i test my engine without machine emulator and i have two pc, one for coding and other for test, but the pc who have intel core i7 is the coding pc.

Ans i want to know other thing, about MAX vertices can be managed by graphic card.

I know the Z-buffer method, so is it right gpu show only XResolutionScreen * YResolutionScreen per frame or show more (i don't use Z-buffer, later )

Thanks

Bernard · ‎09-12-2014

>>>Ans i want to know other thing, about MAX vertices can be managed by graphic card.>>>

What are MAX vertices?

Usually programmable graphic pipeline(GPU) can managed vertex shading an lighting.

Anonymous · ‎09-12-2014

i talk about pixel/coord of an 3D object and his texture.

Finnaly about the CRTC information block, it don't take any effect, wrong alert :/

Anonymous · ‎09-12-2014

"managed" is not correct word.

How many max. vertex, GPU can show in real time (generaly) per frame in video game, less pixel remove by Z-buffer method.

And include texture, i ask it because i want to know if CPU can swap GPU for graphic, and it's my goal to prove it if it's possible of course

Bernard · ‎09-13-2014

shaynox s. wrote:

"managed" is not correct word.

How many max. vertex, GPU can show in real time (generaly) per frame in video game, less pixel remove by Z-buffer method.

And include texture, i ask it because i want to know if CPU can swap GPU for graphic, and it's my goal to prove it if it's possible of course

Take a look at Kribi 3D software engine http://www.inartis.com/Products/Kribi%203D%20Engine/Default.aspx

On IDZ was very active one of the Kribi engine developers his nick is bronxz

Bernard · ‎09-13-2014

>>>How many max. vertex, GPU can show in real time (generaly) per frame in video game, less pixel remove by Z-buffer method.>>>

I am searching for the relevant data.

Bernard · ‎09-13-2014

If you are into ray tracing renderer here is a link to pbrt project with full source code of raytracer.

http://www.pbrt.org/

Anonymous · ‎09-13-2014

Wow, thanks for your urls, very interresting project.

And for maximum vertex, i think GPU, once he get the scene, remove all vertex useless don't show, for keep only pixel able to show and calculate them by physical, light, ...

So in summary GPU treat only maximum ( XresolutionScreen * YresolutionScreen ) vertex per frame, am i right ?

Bernard · ‎09-13-2014

shaynox s. wrote:

Wow, thanks for your urls, very interresting project.

And for maximum vertex, i think GPU, once he get the scene, remove all vertex useless don't show, for keep only pixel able to show and calculate them by physical, light, ...

So in summary GPU treat only maximum ( XresolutionScreen * YresolutionScreen ) vertex per frame, am i right ?

Yes you are right. I think that this is called occlusion culling.

There is another very good 3D engine developed by Phd Dave Eberly which is called Wild Magic. Take a look at it. You can also read author's books about development of 3D engine , but beware very math heavy.

http://www.geometrictools.com/

Anonymous · ‎09-13-2014

Thanks for all, i will look that

And do you know what is bit addressing mode video game on game station? (xbox, ps4, ..)

Anonymous · ‎09-13-2014

i saw an asembler program for xbox emulator in 32 bit

Bernard · ‎09-13-2014

shaynox s. wrote:

Thanks for all, i will look that

And do you know what is bit addressing mode video game on game station? (xbox, ps4, ..)

PS 4 uses semi-custom CPU with integrated GPU.

http://en.wikipedia.org/wiki/PlayStation_4

Anonymous · ‎09-17-2014

Back, is there an instruction for truncation smid register directly ?

i use those code for do that, unfortunately:

			cvtps2pi  	mm0, xmm0	
			cvtpi2ps	xmm0, mm0

Anonymous · ‎09-17-2014

Finnaly for the bug (crash when i want more pixel), isn't the fault to code, just i erase some important BIOS DATA and cause reset of CPU :/.

Actually i put data over 0x10_0000 (Mo) of RAM :/, i guess i will rearrange all fragment BIOS DATA AREA in RAM one day :o (why this fragmentation !)

And finally by your advice, i translate my engine in C, i use SDL for initializes video mode, and here the result:

for 1_000_000 voxels, my engine run under 7 fps.

for 125_000 voxels, it's under 50 fps.

But the asm write by GCC without built-in is still fill by fpu instruction. And i have take decision to integrate some assembler code (SMID instruction) in line to C, but have a pb with it (not declaration of matrix's data):

    asm ("       jmp        skip_data           \n"
         "       translation:   .long   0       \n"
         "       translation_x:	.long   0       \n"
         "       translation_y:	.long   0       \n"
         "       translation_z:	.long   0       \n"
         "                      .long   0       \n"         //; reserved: color

         "    angle:            .long   0       \n"
         "        rotation_x:   .long   0       \n"
         "        _xmm0:        .long   0       \n"			//; sin.x	0
         "                      .long   1,0     \n"			//; cos.x	4
         "                      .long   1,0     \n"			//; cos.x	8
         "                      .long   0       \n"			//; sin.x	12
         "                      .long   1,0     \n"			//; 1		16
         "                      .long   0       \n"			//; sin.y	20

         "        rotation_z:   .long	0       \n"
         "        _xmm1:        .long	0       \n"			//; sin.z	0
         "                      .long	1,0     \n"			//; cos.z 4
         "                      .long   1,0      \n"			//; cos.z	8
         "                      .long   0        \n"			//; sin.z	12
         "                      .long	1,0     \n"			//; 1		16
         "                      .long   0        \n"			//; sin.y	20

         "        rotation_y:   .long	0       \n"
         "        _xmm2:        .long   0        \n"			//; sin.y	0
         "                      .long   1,0      \n"			//; cos.y 4
         "                      .long   1,0      \n"			//; cos.y 8
         "                      .long   0        \n"			//; sin.y	12
         "                      .long   1,0      \n"			//; 1		16
         "                      .long   0        \n"			//; sin.y	20

         "    color_pixel:		.long	0       \n"
         "    coordonee:                        \n"
         "            x:		.long	0       \n"
         "            y:		.long	0       \n"
         "            z:		.long	0       \n"
         "                      .long	0       \n"			//; reserved: color
         "            _x:       .long	0       \n"
         "            _y:       .long   0       \n"

         "   conv_signe:        .long   -0,0        \n"
         "   rapport:           .long	1,3333333   \n"
         "    skip_data:                            \n"
			 );

    asm (
              "make_rotations:                        \n"
                //;=============
		// yaw
		//=============
                // y
				// On applique la rotation au point	|[esi + 0] = x
				//									|[esi + 4] = y
				//									|[esi + 8] = z
				// On calcule x = x.cos(phi.y) * cos(phi.z) - y.cos(phi.y) * sin(phi.z) - z.sin(phi.y)
				//
				// On calcule  A = x.cos(phi.y), B = y.cos(phi.y) et C = z.sin(phi.y)
		"			movups	xmm0, [_xmm2 + 4]       \n"
		"			movups	xmm1, [coordonee]       \n"
		"			mulps	xmm0, xmm1              \n"

				// On calcule D = A * cos(phi.z), E = B * sin(phi.z) et C = C * 1
		"			movups	xmm1, [_xmm1 + 8]       \n"
		"			mulps	xmm0, xmm1              \n"

				// On calcule F = D - E, C = C - 0
		"			hsubps	xmm0, xmm0              \n"

				// On calcule xmm0 = F - C
		"			hsubps	xmm0, xmm0              \n"

				// On modifie x selon selon le rapport entre x et y pour que x soit proportionnelle à y
		"			movd	xmm1, [rapport]         \n"
		"			divps	xmm0, xmm1              \n"

				// On save la new coordonée
		"			movd	[_x], xmm0              \n"

		//=============
		// / yaw
		//=============

		//=============
		// pitch
		//=============
                // x
				// On applique la rotation au point	|[esi + 0] = x
				//									|[esi + 4] = y
				//									|[esi + 8] = z
				// On calcule y = x.(cos(phi.x) * sin(phi.z) - sin(phi.x) * cos(phi.z) * sin(phi.y)) +
				//				 y.(sin(phi.x) * sin(phi.z) * sin(phi.y) + cos(phi.x) * cos(phi.z)) -
				//				 z.(sin(phi.x) * cos(phi.y))
				//
				// On calcule A = cos(phi.x) * sin(phi.z), B = sin(phi.x) * cos(phi.z), E = cos(phi.x) * cos(phi.z) et F = sin(phi.x) * sin(phi.z)
		"			movddup xmm0, [_xmm0 + 8]       \n"
		"			movups 	xmm1, [_xmm1]           \n"
		"			mulps	xmm0, xmm1              \n"

				// on sauve xmm0 dans xmm7 pour le copier dans xmm0 de Roll car l'equation de y ressemblent a l'equation de z mis a part que la valeur sin(phi.y) est
				// multiplié par d'autres equations

				// On calcule C' = A' * sin(phi.y) et G' = E' * sin(phi.y)
		"			movddup	xmm7, [_xmm2 + 12]       \n"
		"			mulps	xmm7, xmm0              \n"

				// On calcule C = B * sin(phi.y) et G = F * sin(phi.y)
		"			movddup	xmm2, [_xmm2 + 16]      \n"
		"			mulps	xmm0, xmm2              \n"

				// Copie le contenu du haut (64..127) d'un paquet de valeurs réel de simple précision (4*32 bits) dans sa partie basse (0..31).
				// En somme on separe les deux partie x et y:	xmm0 =	A) cos(phi.x) * sin(phi.z)								xmm0 =	cos(phi.x) * sin(phi.z)
				//											 			C) sin(phi.x) * cos(phi.z) * sin(phi.y) 			=>			sin(phi.x) * sin(phi.y) * cos(phi.z)
				//														E) cos(phi.x) * cos(phi.z)								xmm1 =	cos(phi.x) * cos(phi.z)
				//														G) sin(phi.x) * sin(phi.z) * sin(phi.y)							sin(phi.x) * sin(phi.y) * sin(phi.z)
		"			movhlps xmm1, xmm0          \n"

				// On calcule D = A - C
		"			hsubps xmm0, xmm0           \n"

				// On calcule H = E + G
		"			haddps xmm1, xmm1           \n"

				// On calcule sin(phi.x) * cos(phi.y) et cos(phi.x) * cos(phi.y)
				//
				// On calcule I.roll = cos(phi.x) * cos(phi.y) et I.Pitch = sin(phi.x) * cos(phi.y)
		"			movlps		xmm3, [_xmm0 + 8]       \n"
		"			movlps		xmm2, [_xmm2 + 4]       \n"
		"			mulps		xmm2, xmm3              \n"
		"			movshdup 	xmm3, xmm2              \n"
				// On calcule x.D + y.H - z.I
				//
				// On calcule J = x.D, K = y.H et L = z.I
		"			movups		xmm5, [coordonee]       \n"
		"			movsldup	xmm4, xmm1              \n"    //; y.H
		"			movss		xmm4, xmm0              \n"    //; x.D
		"			movlhps 	xmm4, xmm3              \n"    //; z.I.Pitch
		"			mulps		xmm4, xmm5              \n"

				// On calcule M = J + K
		"			haddps	xmm4, xmm4       \n"

				// On calcule N = M - L
		"			hsubps	xmm4, xmm4       \n"

				// On save la new coordonée
		"			movd	[_y], xmm4       \n"

		//=============
		// / pitch
		//=============
		//=============
		// roll
		//=============
                // z
				// On applique la rotation au point	|[esi + 0] = x
				//									|[esi + 4] = y
				//									|[esi + 8] = z
				// On calcule z' = x.(cos(phi.x) * cos(phi.z) * sin(phi.y) + sin(phi.x) * sin(phi.z)) +
				//				  y.(sin(phi.x) * cos(phi.z) - cos(phi.x) * sin(phi.z) * sin(phi.y)) +
				//				  z.(cos(phi.x) * cos(phi.y))
				//
				// Copie le contenu du haut (64..127) d'un paquet de valeurs réel de simple précision (4*32 bits) dans sa partie basse (0..31).
				// En somme on separe les deux partie x et y:	xmm7 =	C') cos(phi.x) * sin(phi.z) * sin(phi.y)				xmm7 =	C') cos(phi.x) * sin(phi.z) * sin(phi.y))
				//											 			B') sin(phi.x) * cos(phi.z)						 =>				B') sin(phi.x) * cos(phi.z)
				//														G') cos(phi.x) * cos(phi.z) * sin(phi.y)				xmm1 =	G') cos(phi.x) * cos(phi.z) * sin(phi.y)
				//														F') sin(phi.x) * sin(phi.z)										F') sin(phi.x) * sin(phi.z
		"			movhlps xmm1, xmm7          \n"

				// On calcule D' = -B' + C'
		"			movd	xmm6, [conv_signe]  \n"
		"			orps	xmm7, xmm6          \n"
		"			haddps	xmm7, xmm7          \n"

				// On calcule H' = G' + F'
		"			haddps	xmm1, xmm1          \n"

				// On calcule x.D' + y.H' + z.I'
				//
				// On calcule J = x.D', K = y.H' et L = z.I'
		"			movups		xmm3, [coordonee]       \n"
		"			movsldup	xmm4, xmm7              \n"    // y.D'
		"			movss		xmm4, xmm1              \n"    // x.H'
		"			movlhps 	xmm4, xmm2              \n"    // z.I'
		"			mulps		xmm4, xmm3              \n"

				// On calcule M' = J' + K'
		"			haddps	xmm4, xmm4       \n"

				// On calcule N' = M' + L'
		"			haddps	xmm4, xmm4       \n"
		//=============
		// / roll
		//=============
			 );

i have debug this manualy, and i found only roll block run normally, other make my program crash, strange (code return 0x3) i guess i will go compile it with intel compiler.

Do you know any IDE with icc ? i use codeblock but it's seem complicate for configue another compiler than gcc, and honnestly i don't like how gcc say to me how to code in assembly (dd -> .long, can't do: (variable) (type) (value), need to put ' : ' and other i guess) just detail but still very embarrassing if i multiply asm inline :/

When i clear the screen by SDL_FillRect(screen,NULL,0);, it's very fast and i don't see the executable of this function, very fast relative of mine:

	;=============
	 ; void clear_screen (void)
	 ; Clear screen
	 ; Entrée : None
	 ; Sotie: Screen
	 ; Destroyed: edi
	;=============	 
	clear_screen:
		mov		edi, [PhysBasePtr]
		mov		ecx, (WIDTH*LENGTH*4)/16 	;	mov		ecx, (WIDTH*LENGTH)/8
		xorps	xmm0, xmm0					;	vxorps	ymm1, ymm1		; 256 bit instruction !
		clear_s:
			movdqu 	[edi], xmm0				;	vmovapd	[edi], ymm1		; 256 bit instruction !
			add		edi, 16					;	add		edi, 32
		loop	clear_s
	ret
	;===============
	; / clear_screen
	;===============

Do SDL interact with gpu for execute the clearing of screen ? i'm lost

Here's my code for count FPS, is it good, or i can do it faster ?

        start_time = SDL_GetTicks();
    while (1)
    {
         ....
        calculate_fps:
            current_time = SDL_GetTicks();
            if (current_time - start_time >= 1000)
            {
                fps = compteur_boucle;
                compteur_boucle = 0;
                start_time = SDL_GetTicks();
            }
            else
                compteur_boucle++;
            printf("FPS = %d\n", fps);

Thanks

PS: sorry for comment, still in french, i'm just lazy to translate it ^^