Solved: 3D engine

Anonymous · ‎09-02-2014

Hello there,

ok, here we go, I have a dream, make a 3D engine 100% assembler intel only with CPU, I use rotation matrix only for now.

it works of course, but it's slow when I put a lot of pixels.

Recently I decided to include voxels in my engine, and it's slow when I put> = 8000 voxels (20 * 20 * 20 cube) and when I saw that nvidia display 32M voxels (fire) I wonder how they can do it !

And I have a little idea of the reason: MMU, paging, segmentation. memory.

Am I right?

Another question, is the FPU is the slowest to compute floating point than SSE or depending of data manipulate ?

PS: I work without OS like Windows or Linux, I run on my own kernel + bootloader in assembly too with NASM.

Sorry if i don't wirte a good english, i'm french and use google translate ^-^

Bradley_W_Intel · ‎09-02-2014

You clearly are using the processor in a very advanced way. I will do my best to answer your questions:

1) Why is your voxel engine not able to efficiently render as many voxels as you'd like? Voxel engines need to maximize their use of parallelism (both threading and SIMD) and also to store the data efficiently in an octree or some other structure that can handle sparse data. If you are doing all these things and still not getting the performance you expect, it's an optimization problem. Some Intel tools like VTune Performance Analyzer are excellent for performance analysis.

2) Is single data floating point math faster than SIMD (if I understood you)? Typically SIMD will be faster than single data instructions if your data is laid out in a way that supports the SIMD calls. In all cases, the only way for you to know for certain which way is faster is to test it.

3) How can you select between discrete and processor graphics? DirectX has methods of enumerating adapters. In such a case, the processor graphics is listed separately from the discrete graphics. If you are choosing your adapter based on the amount of available memory, you may be favoring the processor graphics when you didn't intend to. Intel has sample code that shows how to properly detect adapters in DirectX at https://software.intel.com/en-us/vcsource/samples/gpu-detect. The process for OpenGL is not well documented.

4) Can I use one processor to control execution of a second processor? Probably not. The details on Intel processors are covered at http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html. It's possible, though unlikely, that you'll be able to find something in there that can help you.

View solution in original post

Anonymous · ‎09-02-2014

And i use a pointer on VESA LFB for draw pixel.

Another question, i don't think understand what is the gpu into cpu intel core, i have this processor: http://ark.intel.com/products/53464/intel-core-i7-2640m-processor-4m-cac....

And have AMD Radeon HD 6470M like graphics card, so can i theoricaly switch gpu ? if yes how do it ^-^.

Cuz i can't do it throung software: Windows Xp :(

Finnaly i want talk about mutlitasking method, i know modern OS use pagination memory for do it, but honnestly i hate this fragmentation of memory (RAM), so can i use my 2nd processor like a selector of code.

I mean this 2nd processor can have a personal memory who is fill by physical address of my program/task into RAM ? and every each sec for exemple, this 2nd processor jump to the next taks's address, and for execute it, he tell to main processor to jump in the address.

I know, if it's possible i would need a personal memory in RAM, so i will delimit at ADDR_TASK_END and ADDR_TASK_START like "segmentation" do it, but i think is more understandable and easier to learn multitasking and don't use all segmentation technologie ^-^.

I want to be master of all my RAM :p

What do you think, is it possible ?

Bradley_W_Intel · ‎09-02-2014

You clearly are using the processor in a very advanced way. I will do my best to answer your questions:

1) Why is your voxel engine not able to efficiently render as many voxels as you'd like? Voxel engines need to maximize their use of parallelism (both threading and SIMD) and also to store the data efficiently in an octree or some other structure that can handle sparse data. If you are doing all these things and still not getting the performance you expect, it's an optimization problem. Some Intel tools like VTune Performance Analyzer are excellent for performance analysis.

2) Is single data floating point math faster than SIMD (if I understood you)? Typically SIMD will be faster than single data instructions if your data is laid out in a way that supports the SIMD calls. In all cases, the only way for you to know for certain which way is faster is to test it.

3) How can you select between discrete and processor graphics? DirectX has methods of enumerating adapters. In such a case, the processor graphics is listed separately from the discrete graphics. If you are choosing your adapter based on the amount of available memory, you may be favoring the processor graphics when you didn't intend to. Intel has sample code that shows how to properly detect adapters in DirectX at https://software.intel.com/en-us/vcsource/samples/gpu-detect. The process for OpenGL is not well documented.

4) Can I use one processor to control execution of a second processor? Probably not. The details on Intel processors are covered at http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html. It's possible, though unlikely, that you'll be able to find something in there that can help you.

Anonymous · ‎09-02-2014

Hello,

Thanks for your anwser, you probably right, it's an optimization problem, i will put AVX/AVX2 instructions later.

When you talk about discrete and processor graphics and enumerating adapters, what do you mean, i'm noob in this domain.

Bye

Anonymous · ‎09-02-2014

And also for working on voxel technique i work on pixel and not block like octree and anyway structure.

I mean, create a cube fill on pixel/voxel in array, then i show this cube through rotation matrix.

Bradley_W_Intel · ‎09-02-2014

Processor graphics = the Intel graphics that are part of your processor. Discrete graphics = your Radeon card.

Anonymous · ‎09-02-2014

(pixel = 3D coordinate)

Bernard · ‎09-03-2014

>>>I want to be master of all my RAM :p>>>

I fear that this is not really possible because OS is managing the resources.You can of course try to allocate phys memory and virtual memory as much as possible until you will crash your program.

Bernard · ‎09-03-2014

>>>and when I saw that nvidia display 32M voxels (fire) I wonder how they can do it !>>>

Probably by offloading the job to the GPU.

Anonymous · ‎09-03-2014

Hi,

I know, that's why i run my little 3D engine on my own OS, it's just a basic kernel who load VESA mode and i can use LFB for draw pixel on screen.

Anonymous · ‎09-03-2014

FInnaly for optimization, i forgot to tell you i use eMachines 350 with Intel® Atom™ Processor N450 (512K Cache, 1.66 GHz) for test :D.

That's explain a little bit why it's ram, i test it on another PC with Intel celeron and it's better ^-^

Anonymous · ‎09-04-2014

Hello,

ok finally i think to a project who using voxels, do you think is it possible ?

Project Atom engine

Steps:
       - manage a lot of pixel, ( ~milliard HD) : actually can't manage more 40 * 40 * 40 coord/voxel/pixel/atom
       - create object through object editor, can call 3D printer for accelerate process.
       - Attach all propriety of an atom to an coord/voxel/pixel.
       - Finnaly put all those coord/voxel/pixel/atom in a video game

Bernard · ‎09-09-2014

>>> - manage a lot of pixel, ( ~milliard HD) : actually can't manage more 40 * 40 * 40 coord/voxel/pixel/atom>>> Probably you meant total bandwidth per second.

Anonymous · ‎09-10-2014

well, normally yes but for me, my program crash when i want more pixel ^^

Bernard · ‎09-11-2014

shaynox s. wrote:

well, normally yes but for me, my program crash when i want more pixel ^^

Did you analyse dump file with GDB? Probably you have a segmentation error.

Anonymous · ‎09-11-2014

i can't, i use nasm for it.

But i stoped my 3D engine, cause to those memory managment, make me crazy ^^

Maybe i will read all Intel® 64 and IA-32 Architectures Software Developer’s Manual Combined Volumes: 1, 2A, 2B, 2C, 3A, 3B, and 3C for found if there is an hack for avoid it.

And you probably right, i create my cube with it:

		mov		esi, Cube2_game
		call	voxel_cube
		mov		[end_scene], esi

	%define		cote_voxel_cube		105
        voxel_cube:
		mov		edx, 0xFF0000FF
		mov		eax, cote_voxel_cube
		fill_cube_volume:
			mov		ebx,  cote_voxel_cube 
			fill_cube_surface:
				mov		ecx,  cote_voxel_cube
				fill_cube_line: 
					mov			[esi + 0], eax		; x
					mov			[esi + 4], ebx		; y
					mov			[esi + 8], ecx		; z
					movdqu		xmm0, [esi]
						cvtdq2ps	xmm0, xmm0		; Convert coord.entier in float
					movdqu		[esi], xmm0
					mov			[esi + 12],	edx		; color_pixel
					add		esi, 16
				sub		ecx, 7
				cmp		ecx, -cote_voxel_cube
				jnle	fill_cube_line
			sub		ebx, 7
			cmp		ebx, -cote_voxel_cube
			jnle	fill_cube_surface
		sub		eax, 7
		cmp		eax, -cote_voxel_cube
		jnle	fill_cube_volume
	ret
	;=====================
	; / void voxel_cube ()
	;=====================

This label "Cube2_game:" is located at the end of my kernel.asm, it's a little bit like malloc, and erase all data after kernel's location: [ORG 0x1000] for put my voxel cube, i know BIOS like to put some random data's location, so i guess it's cause to that.

	end_scene	dd	0
	Cube2_game:

I can link my bootloader and kernel here, but some comment is french, i can translate if you want (! approximate eng !).

Finnaly can you explain me, how work aligned/unligned data ?

Before i put movdqa instead movdqu in this function:

	;=============
	 ; void clear_screen (void)
	 ; Clearing the screen
	 ; Input : esi - name of object to erase
	 ; Output: Ecran
	 ; Destroyed: ecx - edi
	;=============	 
	clear_screen:
		mov		edi, [PhysBasePtr]
		mov		ecx, (WIDTH*LENGTH*4)/16 	;	mov		ecx, (WIDTH*LENGTH)/8
		xorps	xmm0, xmm0					;	vxorps	ymm1, ymm1		; 256 bit instruction !
		clear_s:
			movdqu 	[edi], xmm0				;	vmovapd	[edi], ymm1		; 256 bit instruction !
			add		edi, 16					;	add		edi, 32
		loop	clear_s
	ret
	;===============
	; / clear_screen
	;===============

And it was more slow than movdqu, i don't get it, in my mind this instruction copy just the source: xmm0 to destination: [edi].

Strange, and yes even if movdqu is faster, it's still low: i mean i see the execution of this function on screen.

And what about buffer you will say me, well i don't think, don't need this. Because the linear frame buffer is a already a buffer, and if i have problem of display with one buffer, i didn't imagine if i draw with 3 buffer.

Ps: Can you add a new tag code for assembly ? thanks

Bernard · ‎09-11-2014

>>>But i stoped my 3D engine, cause to those memory managment, make me crazy>>>

Why do not you try to develop 3D engine in C++ for Windows or Linux? I am also at early stage of developing my own engine , but I am writing it in C++ and inline assembly.

Bernard · ‎09-11-2014

>>>Finnaly can you explain me, how work aligned/unligned data ?>>>

Please read following article: https://software.intel.com/en-us/articles/reducing-the-impact-of-misaligned-memory-accesses

Bernard · ‎09-11-2014

@shaynox

How do you call VESA API?

Anonymous · ‎09-11-2014

"Why do not you try to develop 3D engine in C++ for Windows or Linux?"

Because i would like to run a game without OS, or mine (in development) and without directx or opengl, sound like chalenge, and i like to be free (love assembler) untill i see pagination mechanism.

And for VESA API, i mean just function for found LFB ptr:

	ModeInfoBlock:
		; Mandatory information for all VBE revisions
			ModeAttributes 				dw 	0 			; mode attributes
			WinAAttributes 				db 	0 			; window A attributes
			WinBAttributes 				db 	0 			; window B attributes
			WinGranularity 				dw 	0 			; window granularity
			WinSize 					dw 	0 			; window size
			WinASegment 				dw 	0 			; window A start segment
			WinBSegment 				dw 	0 			; window B start segment
			WinFuncPtr 					dd 	0 			; real mode pointer to windowfunction
			BytesPerScanLine 			dw 	0 			; bytes per scan line
		
		; Mandatory information for VBE 1.2 and above
			XResolution 				dw 	0 			; horizontal resolution in pixels or characters
			YResolution 				dw 	0 			; vertical resolution in pixels or characters
			XCharSize 					db 	0 			; character cell width in pixels
			YCharSize 					db 	0 			; character cell height in pixels
			NumberOfPlanes 				db 	0 			; number of memory planes
			BitsPerPixel 				db 	0 			; bits per pixel
			NumberOfBanks 				db 	0 			; number of banks
			MemoryModel 				db 	0 			; memory model type
			BankSize 					db	0 			; bank size in KB
			NumberOfImagePages 			db 	0 			; number of images
			Reserved 					db 	1 			; reserved for page function
		
		; Direct Color fields (required for direct/6 and YUV/7 memory models)
			RedMaskSize 				db 	0 			; size of direct color red mask in bits
			RedFieldPosition			db 	0 			; bit position of lsb of red mask
			GreenMaskSize 				db 	0 			; size of direct color green mask in bits
			GreenFieldPosition 			db 	0 			; bit position of lsb of green mask
			BlueMaskSize 				db 	0 			; size of direct color blue mask in bits
			BlueFieldPosition 			db 	0 			; bit position of lsb of blue mask
			RsvdMaskSize 				db 	0 			; size of direct color reserved mask in bits
			RsvdFieldPosition 			db 	0 			; bit position of lsb of reserved mask
			DirectColorModeInfo 		db 	0 			; direct color mode attributes
		
		; Mandatory information for VBE 2.0 and above
			PhysBasePtr 				dd 	0 			; physical address for flat memory frame buffer
										dd 	0 			; Reserved - always set to 0
										dw 	0 			; Reserved - always set to 0
		
		; Mandatory information for VBE 3.0 and above
			LinBytesPerScanLine 		dw 	0 			; bytes per scan line for linear modes
			BnkNumberOfImagePages 		db 	0 			; number of images for banked modes
			LinNumberOfImagePages 		db 	0 			; number of images for linear modes
			LinRedMaskSize 				db 	0			; size of direct color red mask (linear modes)
			LinRedFieldPosition			db 	0 			; bit position of lsb ofred mask (linear modes)
			LinGreenMaskSize 			db 	0 			; size of direct color green mask (linear modes)
			LinGreenFieldPositiondb 	db	0 			; bit position of lsb of green mask (linear modes)
			LinBlueMaskSize 			db 	0 			; size of direct color blue mask (linear modes)
			LinBlueFieldPosition 		db 	0 			; bit position of lsb of blue mask (linear modes)
			LinRsvdMaskSize 			db 	0 			; size of direct color reserved mask (linear modes)
			LinRsvdFieldPosition 		db 	0 			; bit position of lsb of reserved mask (linear modes)
			MaxPixelClock 				dd 	0 			; maximum pixel clock (in Hz) for graphics mode
			times	189					db 	0			; reserved area remainder of ModeInfoBlock

				; Return VBE Mode Information
					mov		ax, 0x4F01
					mov		cx, 0x115           ; Mode number = 800 * 600
					mov		di, ModeInfoBlock   ; Pointer to ModeInfoBlock structure
					int		0x10

				; Set VBE Mode
					mov		ax, 0x4F02
					mov		bx, 0x4115           ; ( for 800 x 600 )
											; D0-D8		=  		Mode number 
											; D9-D10 	= 0		Reserved (must be 0)
											; D11 		= 0 	Use current default refresh rate
											; 			= 1  	Use user specified CRTC values for refresh rate
											; D12-13 	= 0 	Reserved for VBE/AF (must be 0)
											; D14 		= 0 	Use windowed frame buffer model
											; 			= 1		Use linear/flat frame buffer model
											; D15  		= 0 	Clear display memory
											; 			= 1		Don't clear display memory
					int		0x10

				mov		edi, [PhysBasePtr]
				mov		[edi], 0xFF_FF_FF_FF     ; 0xAA_RR_BB_GG white

Anonymous · ‎09-11-2014

And about LFB pointer, i don't get it why this algo is right, it's mine and it's work:

		; Algo: LFB = 0x40000000+((RAM_SIZE_Go-1)*0x40000000)
			;
			; If RAM == 1 Go { LFB = 0x040000000 }
			;									  + 0x40000000
			; If RAM == 2 Go { LFB = 0x080000000 }
			;									  + 0x40000000
			; If RAM == 3 Go { LFB = 0x0C0000000 }
			;									  + 0x40000000
			; If RAM == 4 Go { LFB = 0x100000000 }
			;									  + 0x40000000
			; If RAM == 5 Go { LFB = 0x140000000 }

In summary, the LFB always target the limit of my RAM oO