Solved: >>> vmovups XMMWORD PTR [48 - Page 5

Anonymous · ‎09-02-2014

Hello there,

ok, here we go, I have a dream, make a 3D engine 100% assembler intel only with CPU, I use rotation matrix only for now.

it works of course, but it's slow when I put a lot of pixels.

Recently I decided to include voxels in my engine, and it's slow when I put> = 8000 voxels (20 * 20 * 20 cube) and when I saw that nvidia display 32M voxels (fire) I wonder how they can do it !

And I have a little idea of the reason: MMU, paging, segmentation. memory.

Am I right?

Another question, is the FPU is the slowest to compute floating point than SSE or depending of data manipulate ?

PS: I work without OS like Windows or Linux, I run on my own kernel + bootloader in assembly too with NASM.

Sorry if i don't wirte a good english, i'm french and use google translate ^-^

Bradley_W_Intel · ‎09-02-2014

You clearly are using the processor in a very advanced way. I will do my best to answer your questions:

1) Why is your voxel engine not able to efficiently render as many voxels as you'd like? Voxel engines need to maximize their use of parallelism (both threading and SIMD) and also to store the data efficiently in an octree or some other structure that can handle sparse data. If you are doing all these things and still not getting the performance you expect, it's an optimization problem. Some Intel tools like VTune Performance Analyzer are excellent for performance analysis.

2) Is single data floating point math faster than SIMD (if I understood you)? Typically SIMD will be faster than single data instructions if your data is laid out in a way that supports the SIMD calls. In all cases, the only way for you to know for certain which way is faster is to test it.

3) How can you select between discrete and processor graphics? DirectX has methods of enumerating adapters. In such a case, the processor graphics is listed separately from the discrete graphics. If you are choosing your adapter based on the amount of available memory, you may be favoring the processor graphics when you didn't intend to. Intel has sample code that shows how to properly detect adapters in DirectX at https://software.intel.com/en-us/vcsource/samples/gpu-detect. The process for OpenGL is not well documented.

4) Can I use one processor to control execution of a second processor? Probably not. The details on Intel processors are covered at http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html. It's possible, though unlikely, that you'll be able to find something in there that can help you.

View solution in original post

Bernard · ‎09-24-2014

>>>can't put avx2, cause i don't have required procesor>>>

Do you have Core i7 IvyBridge CPU?

Your code should benefit from Haswell CPU AVX2 ISA mainly because of FMA units on Port0 and Port1. If compiler could emit FMA instruction you could speed up execution of polynomial like code which contains additions and multiplications.So you can achieve 16 DP FP/cycle/core.

result = (a + b) * (c + d)

Bernard · ‎09-24-2014

I will later test your code on my Core i5 Haswell machine.

Anonymous · ‎09-24-2014

AoS : fail :/

When you talk about aligned data, do you mean the way how data are store in RAM ? i know high level language fragment data, and in assember we work always with aligned data, i triy to put static like my object: rhino, but steel mulss :/

But i'm curious what is asm code if i calculate rotation matrix without unroll matrix.

How do you align data also on page boundary 4KB ?

Anonymous · ‎09-24-2014

"restrict" pointer is like static array no ?

on intel compiler i run on 210 fps :o

Bernard · ‎09-24-2014

>>>When you talk about aligned data, do you mean the way how data are store in RAM >>>

Yes and usually compiler will arange array linearly as opposed to allocation of objects on the heap or to allocation of linked list which of course will be allocated during the runtime.

Anonymous · ‎09-24-2014

i'm on laptop with core i7 sandy bridge..

Anonymous · ‎09-24-2014

I have brought (good word ?^^) 250+ fps, with testing some option intel compiler, althoug the avx instruction use XMM register, and 32 first bit for compute:

;;; 	static float coord[3];
;;; 	static float end_coord[3];
;;; 
;;; 	static float cosX;
;;; 	static float cosY;
;;; 	static float cosZ;
;;; 	static float sinX;
;;; 	static float sinY;
;;; 	static float sinZ;
;;; 	int     offset_pixel;
;;; 
;;; 	cosX = cos(DEG2RAD(rotation_object[0]));

        vxorpd    xmm3, xmm3, xmm3                              ;39.9
$LN3:
        vcvtss2sd xmm3, xmm3, DWORD PTR [rotation_object]       ;39.9
$LN4:
        vmovups   XMMWORD PTR [32+rsp], xmm15                   ;27.1
$LN5:
        mov       r14d, r9d                                     ;27.1
$LN6:
        vmovsd    xmm15, QWORD PTR [_2il0floatpacket.0]         ;39.13
$LN7:
        vmovups   XMMWORD PTR [112+rsp], xmm11                  ;27.1
$LN8:
        vmovaps   xmm11, xmm2                                   ;27.1
$LN9:
        vmovups   XMMWORD PTR [128+rsp], xmm10                  ;27.1
$LN10:
        vmovaps   xmm10, xmm1                                   ;27.1
$LN11:
        vmovups   XMMWORD PTR [144+rsp], xmm9                   ;27.1
$LN12:
        vmovaps   xmm9, xmm0                                    ;27.1
$LN13:
        vmulsd    xmm0, xmm3, xmm15                             ;39.9
$LN14:
        vmovups   XMMWORD PTR [48+rsp], xmm14                   ;27.1
$LN15:
        vmovups   XMMWORD PTR [64+rsp], xmm13                   ;27.1
$LN16:
        vmovups   XMMWORD PTR [96+rsp], xmm12                   ;27.1
$LN17:
        vmovups   XMMWORD PTR [80+rsp], xmm6                    ;27.1
$LN18:
        call      __libm_sse2_sincos                            ;39.9
$LN19:
                                ; LOE rbx rbp rsi rdi r12 r13 r15 r14d xmm0 xmm1 xmm7 xmm8 xmm9 xmm10 xmm11 xmm15
.B1.12::                        ; Preds .B1.1
$LN20:

;;; 	cosY = cos(DEG2RAD(rotation_object[1]));

        vxorpd    xmm2, xmm2, xmm2                              ;40.9
$LN21:
        vmovapd   xmm14, xmm0                                   ;39.9
$LN22:
        vcvtss2sd xmm2, xmm2, DWORD PTR [rotation_object+4]     ;40.9
$LN23:
        vcvtsd2ss xmm13, xmm1, xmm1                             ;39.2
$LN24:
        vmulsd    xmm0, xmm15, xmm2                             ;40.9
$LN25:
        vmovss    DWORD PTR [cosX.5146.0.1], xmm13              ;39.2
$LN26:
        call      __libm_sse2_sincos                            ;40.9
$LN27:
                                ; LOE rbx rbp rsi rdi r12 r13 r15 r14d xmm0 xmm1 xmm7 xmm8 xmm9 xmm10 xmm11 xmm13 xmm14 xmm15
.B1.11::                        ; Preds .B1.12
$LN28:

;;; 	cosZ = cos(DEG2RAD(rotation_object[2]));

        vxorpd    xmm2, xmm2, xmm2                              ;41.9
$LN29:
        vmovapd   xmm6, xmm0                                    ;40.9
$LN30:
        vcvtss2sd xmm2, xmm2, DWORD PTR [rotation_object+8]     ;41.9
$LN31:
        vcvtsd2ss xmm12, xmm1, xmm1                             ;40.2
$LN32:
        vmulsd    xmm0, xmm15, xmm2                             ;41.9
$LN33:
        vmovss    DWORD PTR [cosY.5146.0.1], xmm12              ;40.2
$LN34:
        call      __libm_sse2_sincos                            ;41.9

Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 1: Basic Architecture

14.1.2 Instruction Syntax Enhancements

Intel AVX employs an instruction encoding scheme using a new prefix (known as “VEX” prefix). Instruction
encoding using the VEX prefix can directly encode a register operand within the VEX prefix. This support two new
instruction syntax in Intel 64 architecture:

• A non-destructive operand (in a three-operand instruction syntax): The non-destructive source reduces the
number of registers, register-register copies and explicit load operations required in typical SSE loops, reduces
code size, and improves micro-fusion opportunities.

• A third source operand (in a four-operand instruction syntax) via the upper 4 bits in an 8-bit immediate field.
Support for the third source operand is defined for selected instructions (e.g. VBLENDVPD, VBLENDVPS,
PBLENDVB).

Two-operand instruction syntax previously expressed in legacy SSE instruction as

ADDPS xmm1, xmm2/m128

128-bit AVX equivalent can be expressed in three-operand syntax as

VADDPS xmm1, xmm2, xmm3/m128

In four-operand syntax, the extra register operand is encoded in the immediate byte.
Note SIMD instructions supporting three-operand syntax but processing only 128-bits of data are considered part
of the 256-bit SIMD instruction set extensions of AVX, because bits 255:128 of the destination register are zeroed
by the processor.

I think it's impossible to full auto vectorise data, except maybe with Intrinsic Functions ...

Anonymous · ‎09-24-2014

I have test with 8_000_000 vertex, it's run under 15 fps, it's good news cause the data's vectorisation is not full.

410625

Bernard · ‎09-24-2014

>>>call __libm_sse2_sincos >>>

Force compiler to inline functions calls.

Bernard · ‎09-25-2014

>>>"restrict" pointer is like static array no ?>>>

Please read description here: http://stackoverflow.com/questions/2005473/rules-for-using-the-restrict-keyword-in-c

Bernard · ‎09-25-2014

>>>I think it's impossible to full auto vectorise data,>>>

I think that it's mainly depends on compiler analysis of the code in order to exploit vectorization. At least data accesses should fit SSE or AVX registers length. There should not be some kind of interdependency between the vectorized data or between data load/store.Moreover compiler will try to calculate theoretical speedup of the vectorization and will try to asses if vectorization will provide the same final result when compared to serialized code.

Bernard · ‎09-25-2014

>>> vmovups XMMWORD PTR [48+rsp], xmm14 >>>

Can you check with GDB the content of xmm14 register? You should look for 4 SP FP data.

Bernard · ‎09-25-2014

typedef struct a 


02 { 


03   


04     float x; 


05   


06     float y; 


07   


08     float z; 


09   


10     float cosX; 


11     float cosY; 


12     float cosZ; 


13     float sinX; 


14     float sinY; 


15     float sinZ; 


16     float x_end; 


17     float y_end; 


18     float z_end; 


19   


20 }a;

I think that you cannot force compiler to vectorize code when your struct members are single variables. You need to operate on float array members.

Anonymous · ‎09-25-2014

For vmovups XMMWORD PTR [48+rsp], xmm14, icl do that for store SMID registrer only, but don't vector calculation, try to found mulps/haddps, ect, and you will don't see any Vector calculation.

I will try assembler inline.

Anonymous · ‎09-25-2014

And for allgned/unaligned memory, there is a big surprise, look that code:

        vmovaps   xmm6, xmm1                                    ;195.28
$LN1464:
        vmovups   XMMWORD PTR [1152+rsp], xmm7                  ;195.28

it mean intel compiler transfer aligned memory to an register, but transfer unaligned memory to an variable memory.

In summary we can't do vmovaps XMMWORD PTR [1152+rsp], xmm7 :/

but i don't understand, aligned value in memory is for vector calculation yes ?

So why it's work when i do unaligned memory transfer and vector calculation on them:

;====================================================================================================
;FONCTIONS		FONCTIONS		  FONCTIONS	    	FONCTIONS	    	FONCTIONS	
;====================================================================================================			
; make_rotations:

		;=============
		; yaw
		;=============
			Yaw:	; y
				; On applique la rotation au point	|[esi + 0] = x
				;									|[esi + 4] = y
				;									|[esi + 8] = z
				; On calcule x = x.cos(phi.y) * cos(phi.z) - y.cos(phi.y) * sin(phi.z) - z.sin(phi.y)
				;
				; On calcule  A = x.cos(phi.y), B = y.cos(phi.y) et C = z.sin(phi.y)
					movups	xmm0, [_xmm2 + 4]
					movups	xmm1, [coordonee]
					mulps	xmm0, xmm1

				; On calcule D = A * cos(phi.z), E = B * sin(phi.z) et C = C * 1
					movups	xmm1, [_xmm1 + 8]
					mulps	xmm0, xmm1

				; On calcule F = D - E, C = C - 0
					hsubps	xmm0, xmm0
				
				; On calcule xmm0 = F - C
					hsubps	xmm0, xmm0
										
				; On modifie x selon selon le rapport entre x et y pour que x soit proportionnelle à y 
					movd	xmm1, [rapport]
					divps	xmm0, xmm1
					
				; On save la new coordonée
					movd	[_x], xmm0

		;=============
		; / yaw
		;=============	
	
		;=============
		; pitch
		;=============
			Pitch:	; x
				; On applique la rotation au point	|[esi + 0] = x
				;									|[esi + 4] = y
				;									|[esi + 8] = z
				; On calcule y = x.(cos(phi.x) * sin(phi.z) - sin(phi.x) * cos(phi.z) * sin(phi.y)) + 
				;				 y.(sin(phi.x) * sin(phi.z) * sin(phi.y) + cos(phi.x) * cos(phi.z)) - 
				;				 z.(sin(phi.x) * cos(phi.y))
				;
				; On calcule A = cos(phi.x) * sin(phi.z), B = sin(phi.x) * cos(phi.z), E = cos(phi.x) * cos(phi.z) et F = sin(phi.x) * sin(phi.z)
					movddup xmm0, [_xmm0 + 8]
					movups 	xmm1, [_xmm1]
					mulps	xmm0, xmm1

				; on sauve xmm0 dans xmm7 pour le copier dans xmm0 de Roll car l'equation de y ressemblent a l'equation de z mis a part que la valeur sin(phi.y) est 
				; multiplié par d'autres equations

				; On calcule C' = A' * sin(phi.y) et G' = E' * sin(phi.y)
					movddup	xmm7, [_xmm2 + 12]
					mulps	xmm7, xmm0		
					
				; On calcule C = B * sin(phi.y) et G = F * sin(phi.y)
					movddup	xmm2, [_xmm2 + 16]
					mulps	xmm0, xmm2
					
				; Copie le contenu du haut (64..127) d'un paquet de valeurs réel de simple précision (4*32 bits) dans sa partie basse (0..31).
				; En somme on separe les deux partie x et y:	xmm0 =	A) cos(phi.x) * sin(phi.z)								xmm0 =	cos(phi.x) * sin(phi.z) 					
				;											 			C) sin(phi.x) * cos(phi.z) * sin(phi.y) 			=>			sin(phi.x) * sin(phi.y) * cos(phi.z)
				;														E) cos(phi.x) * cos(phi.z)								xmm1 =	cos(phi.x) * cos(phi.z) 
				;														G) sin(phi.x) * sin(phi.z) * sin(phi.y)							sin(phi.x) * sin(phi.y) * sin(phi.z) 
					movhlps xmm1, xmm0
					 
				; On calcule D = A - C
					hsubps xmm0, xmm0
					
				; On calcule H = E + G					
					haddps xmm1, xmm1
 
				; On calcule sin(phi.x) * cos(phi.y) et cos(phi.x) * cos(phi.y)
				;
				; On calcule I.roll = cos(phi.x) * cos(phi.y) et I.Pitch = sin(phi.x) * cos(phi.y) 
					movlps		xmm3, [_xmm0 + 8]
					movlps		xmm2, [_xmm2 + 4]
					mulps		xmm2, xmm3
					movshdup 	xmm3, xmm2
				; On calcule x.D + y.H - z.I
				;
				; On calcule J = x.D, K = y.H et L = z.I
					movups		xmm5, [coordonee]
					movsldup	xmm4, xmm1	; y.H
					movss		xmm4, xmm0	; x.D
					movlhps 	xmm4, xmm3	; z.I.Pitch
					mulps		xmm4, xmm5
					
				; On calcule M = J + K
					haddps	xmm4, xmm4
					
				; On calcule N = M - L
					hsubps	xmm4, xmm4
					
				; On save la new coordonée
					movd	[_y], xmm4
					
		;=============
		; / pitch
		;=============
		;=============
		; roll
		;=============
			Roll:	; z	
				; On applique la rotation au point	|[esi + 0] = x
				;									|[esi + 4] = y
				;									|[esi + 8] = z
				; On calcule z' = x.(cos(phi.x) * cos(phi.z) * sin(phi.y) + sin(phi.x) * sin(phi.z)) + 
				;				  y.(sin(phi.x) * cos(phi.z) - cos(phi.x) * sin(phi.z) * sin(phi.y)) +
				;				  z.(cos(phi.x) * cos(phi.y))
				;			
				; Copie le contenu du haut (64..127) d'un paquet de valeurs réel de simple précision (4*32 bits) dans sa partie basse (0..31).
				; En somme on separe les deux partie x et y:	xmm7 =	C') cos(phi.x) * sin(phi.z) * sin(phi.y)				xmm7 =	C') cos(phi.x) * sin(phi.z) * sin(phi.y))
				;											 			B') sin(phi.x) * cos(phi.z)						 =>				B') sin(phi.x) * cos(phi.z)
				;														G') cos(phi.x) * cos(phi.z) * sin(phi.y)				xmm1 =	G') cos(phi.x) * cos(phi.z) * sin(phi.y)
				;														F') sin(phi.x) * sin(phi.z)										F') sin(phi.x) * sin(phi.z
					movhlps xmm1, xmm7
					
				; On calcule D' = -B' + C'
					movd	xmm6, [conv_signe]
					orps	xmm7, xmm6
					haddps	xmm7, xmm7
					
				; On calcule H' = G' + F'
					haddps	xmm1, xmm1		
					
				; On calcule x.D' + y.H' + z.I'
				;
				; On calcule J = x.D', K = y.H' et L = z.I'
					movups		xmm3, [coordonee]
					movsldup	xmm4, xmm7	; y.D'
					movss		xmm4, xmm1	; x.H'
					movlhps 	xmm4, xmm2	; z.I'
					mulps		xmm4, xmm3
					
				; On calcule M' = J' + K'
					haddps	xmm4, xmm4
					
				; On calcule N' = M' + L'
					haddps	xmm4, xmm4
		;=============
		; / roll
		;=============
; ret			
;====================================================================================================
;END_FONCTIONS		END_FONCTIONS		  END_FONCTIONS	    	END_FONCTIONS	    	END_FONCTIONS	
;====================================================================================================

Anonymous · ‎09-25-2014

I still don't understand Data structure alignment, i read this on wikipedia:

Data structure alignment is the way data is arranged and accessed in computer memory. It consists of two separate but related issues: data alignment and data structure padding. When a modern computer reads from or writes to a memory address, it will do this in word sized chunks (e.g. 4 byte chunks on a 32-bit system) or larger.

why this system ? [0x0000_0000] point on first byte address memory and [0x0000_0001] point on second byte

mov eax, [0x0000_0000] ; begin to store 4 byte after the first byte of RAM to eax

mov eax, [0x0000_0001] ; begin to store 4 byte after the second byte of RAM to eax

can we disable this memory management, and access to data address byte after byte ?

Anonymous · ‎09-25-2014

For clear_screen i use the same algorithme like in my kernel.asm:

void        clear_screen(void)
{
	int     loop = 0;
	while (loop < LENGTH*WIDTH)
	{
		WindowsD[loop++] = 0;
	}
}

	;=============
	 ; void clear_screen (void)
	 ; Clear screen
	 ; Entrée : None
	 ; Sotie: Screen
	 ; Destroyed: edi
	;=============	 
	clear_screen:
		mov		edi, [PhysBasePtr]					
		mov		ecx, (WIDTH*LENGTH*4)/16
		; vxorps	ymm1, ymm1			; 256 bit instruction !		
		xorps	xmm0, xmm0
		clear_s:
			; vmovdqu	[edi], ymm1		; 256 bit instruction !		
			movdqu 	[edi], xmm0		
			add		edi, 16
		loop	clear_s
	ret
	;===============
	; / clear_screen
	;===============

But the one is fastest than second :/

Here's the assembly code from c:

        lea       rcx, QWORD PTR [WindowsD]                     ;190.3
$LN709:
        xor       edx, edx                                      ;190.3
$LN710:
        mov       r8d, 8294400                                  ;190.3
$LN711:
        call      _intel_fast_memset                            ;190.3

(mov r8d, 8294400 is for loop instruction: while(r8d--) { .. .}, 8294400 is the size of WindowsD who translate by LENGTH*WIDTH.

(xor edx, edx is the value who will clear memory)

for call _intel_fast_memset, i don't acces to this code, but it do probably:

while(r8d--)
{
     [rcx] = edx;     / fill pixel location at rcx by 0x0000_0000
     rcx += 4;
}

Anonymous · ‎09-25-2014

I trying asm in line, and have error: Unknown opcode DD in asm instruction, with:

void __declspec(naked) make_rotation()
{
	// Naked functions must provide their own prolog...
	__asm{
				translation:		dd  0
                               ...
		  }
}

Bernard · ‎09-26-2014

shaynox s. wrote:

I still don't understand Data structure alignment, i read this on wikipedia:

Data structure alignment is the way data is arranged and accessed in computer memory. It consists of two separate but related issues: data alignment and data structure padding. When a modern computer reads from or writes to a memory address, it will do this in word sized chunks (e.g. 4 byte chunks on a 32-bit system) or larger.

why this system ? [0x0000_0000] point on first byte address memory and [0x0000_0001] point on second byte

mov eax, [0x0000_0000] ; begin to store 4 byte after the first byte of RAM to eax

mov eax, [0x0000_0001] ; begin to store 4 byte after the second byte of RAM to eax

can we disable this memory management, and access to data address byte after byte ?

Data is aligned laid out by chunks of double word (4-bytes). Data padding or structure padding is used to align structure on for example double word boundaries. If you have structure with 3 members 2 of them are uint and one of them is uchar so you will need 3 uchar bytes of padding. I do not think that you can disable memory management.

Anonymous · ‎09-26-2014

ok, and for declaration of data, i don't use C declaration, because icl don't align data, it put them randomly unfortunnaly.

Bernard · ‎09-26-2014

Usually the code is incremented by 1,2 or 4.

mov eax, dword ptr [esi+4]

In this case value at memory address esi+4 will be loaded into eax. From the CPU and Memory Controller point of view there are only memory cells at granularity of one byte.In order to address memory of uint array base address of first dword must be loaded in register and further that register must be incremented by 4.