- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
			
				
					
					
						Hi, everyone
in my program, the code like this:
for(...)
{
broadcast v6 , [ ptr ]
broadcast v7 , [ ptr+8 ]
...
broadcast v10, [ ptr+40 ]
vop( v0, v6 )
...
vop( v5, v10 )
broadcast v6 , [ ptr+1]
broadcast v7 , [ ptr+9 ]
...
broadcast v10, [ ptr+41 ]
vop( v0, v6 )
...
vop( v5, v10 )
......
broadcast v6 , [ ptr+5 ]
broadcast v7 , [ ptr+13 ]
...
broadcast v10, [ ptr+45 ]
vop( v0, v6 )
...
vop( v5, v10 )
}
my question isif usedvperm2f128 and shufps instead the vbroadcast, which is effectivity?
like this:
for(... )
{
vmovaps v6 , [ ptr ]
...
vmovaps v10, [ ptr ]
vperm2f128 v11, v6, v6,imm
vshufps v11, v11, v11, imm
vop( v0, v11 )
...
vperm2f128 v11, v11, v11,imm
vshufps v11, v11, v11, imm
vop( v5, v11 )
vperm2f128 v11, v6, v6,imm2
vshufps v11, v11, v11, imm2
vop( v0, v11 )
...
vperm2f128 v11, v11, v11,imm2
vshufps v11, v11, v11, imm2
vop( v5, v11 )
...
 
vperm2f128 v11, v6, v6,imm5
vshufps v11, v11, v11, imm5
vop( v0, v11 )
...
vperm2f128 v11, v11, v11,imm5
vshufps v11, v11, v11, imm5
vop( v5, v11 )
}
Thanks!
					
				
			
			
				
			
			
			
			
			
			
			
		
		
		
	
	
	
in my program, the code like this:
for(...)
{
broadcast v6 , [ ptr ]
broadcast v7 , [ ptr+8 ]
...
broadcast v10, [ ptr+40 ]
vop( v0, v6 )
...
vop( v5, v10 )
broadcast v6 , [ ptr+1]
broadcast v7 , [ ptr+9 ]
...
broadcast v10, [ ptr+41 ]
vop( v0, v6 )
...
vop( v5, v10 )
......
broadcast v6 , [ ptr+5 ]
broadcast v7 , [ ptr+13 ]
...
broadcast v10, [ ptr+45 ]
vop( v0, v6 )
...
vop( v5, v10 )
}
my question isif usedvperm2f128 and shufps instead the vbroadcast, which is effectivity?
like this:
for(... )
{
vmovaps v6 , [ ptr ]
...
vmovaps v10, [ ptr ]
vperm2f128 v11, v6, v6,imm
vshufps v11, v11, v11, imm
vop( v0, v11 )
...
vperm2f128 v11, v11, v11,imm
vshufps v11, v11, v11, imm
vop( v5, v11 )
vperm2f128 v11, v6, v6,imm2
vshufps v11, v11, v11, imm2
vop( v0, v11 )
...
vperm2f128 v11, v11, v11,imm2
vshufps v11, v11, v11, imm2
vop( v5, v11 )
...
vperm2f128 v11, v6, v6,imm5
vshufps v11, v11, v11, imm5
vop( v0, v11 )
...
vperm2f128 v11, v11, v11,imm5
vshufps v11, v11, v11, imm5
vop( v5, v11 )
}
Thanks!
Link Copied
		5 Replies
	
		
		
			
			
			
					
	
			- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
			
				
					
					
						I would suggest to implement both approaches and run them through Intel Architecture Code Ananlyzer which you can download from http://software.intel.com/en-us/articles/intel-architecture-code-analyzer/.
					
				
			
			
				
			
			
			
			
			
			
			
		
		
		
	
	
	
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As suggested by Nicolae, use architecture code analyzer. It is just not the question of using vbroadcast or vperm/shuffle, the use also depends on other operations instructions. Perm/shuffle will put more pressure on port 5. if your code is not port 5 limited you may benefit from there.
Secondly, if you are broadcasting constants, i will suggest it may be defined as in broadcasted format:
const0, const1, const2 as
const0, const0, const0, const0, const0 const0, cosnt0, const0.
then you will avoid both broadcast and perms/shuffle. it will be only one load instruction.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
			
				
					
					
						First ,thanks for Nicoale and Brijender!
I considered the question about the port 5 pressure, but will the latencyform port5 grather the latency form load data form host memory? There`s nothing information about this now.
In my program, each loop has 48 shuffler ops( if use the instead scheme )and 36 broadcast ops( the origon scheme,but 18 ops in fact due to 2 LS ports can be used). so now,the question is will the latency ofinstruction port pressure caused > the latency of data loads * 3 (except the first batchload batch, the next five batchsforecast good cache hits ).Ididn`t get the information about the instructionslatency and the data transfs latency, the AVX code analyzer can not be answer the fully question.
		
		
	
	
	
I considered the question about the port 5 pressure, but will the latencyform port5 grather the latency form load data form host memory? There`s nothing information about this now.
In my program, each loop has 48 shuffler ops( if use the instead scheme )and 36 broadcast ops( the origon scheme,but 18 ops in fact due to 2 LS ports can be used). so now,the question is will the latency ofinstruction port pressure caused > the latency of data loads * 3 (except the first batchload batch, the next five batchsforecast good cache hits ).Ididn`t get the information about the instructionslatency and the data transfs latency, the AVX code analyzer can not be answer the fully question.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
			
				
					
					
						Let us look at in a different way:
1. Approach 1: 48 shuffle + "x" numbers of loads.
2. 36 broadcasts - will have 36 loads and 36 shuffles.
Assuming someconstants for load latency and shuffle instructions, you can compute a comparable values for two approaches.
Another thing i noticed in your for loop (it may be typo) that you are loading V0 and V5 from one location "ptr". it may be differnt for each for V0--- V5. So, you will have 5 extra loads in both cases. But total # of loads in #1 are much less than in #2. However in both cases Port 5 (which executes shuffles) is loaded in addition to load ports. Thats why VOPS in algorithm plays a critical role. If they end up on port 5, you may be better to going to approach2. Also, if compiler can generate code to remove dependency from loads/shuffles (i mean interleave code with independent operations) it can hide this port pressure.
It is hard to know the algorithm performnce from looking only part of the instruction getting executed. Algorithm may have bottleneck totally in the end, where whole VOPS tree is collapsed.
					
				
			
			
				
			
			
			
			
			
			
			
		
		
		
	
	
	
1. Approach 1: 48 shuffle + "x" numbers of loads.
2. 36 broadcasts - will have 36 loads and 36 shuffles.
Assuming someconstants for load latency and shuffle instructions, you can compute a comparable values for two approaches.
Another thing i noticed in your for loop (it may be typo) that you are loading V0 and V5 from one location "ptr". it may be differnt for each for V0--- V5. So, you will have 5 extra loads in both cases. But total # of loads in #1 are much less than in #2. However in both cases Port 5 (which executes shuffles) is loaded in addition to load ports. Thats why VOPS in algorithm plays a critical role. If they end up on port 5, you may be better to going to approach2. Also, if compiler can generate code to remove dependency from loads/shuffles (i mean interleave code with independent operations) it can hide this port pressure.
It is hard to know the algorithm performnce from looking only part of the instruction getting executed. Algorithm may have bottleneck totally in the end, where whole VOPS tree is collapsed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
			
				
					
					
						Thanks for Brijender.
"Another thing i noticed in your for loop (it may be typo) that you are loading V0 and V5 from one location "ptr". it may be differnt for each for V0--- V5"
the 'ptr' willbe added an offset for each end in theloop:) .
your answers isimportant, I will testing and post the result.
In addition, when will be intrduced the Knights Corner? Do youhave the news in this regard?
		
		
	
	
	
"Another thing i noticed in your for loop (it may be typo) that you are loading V0 and V5 from one location "ptr". it may be differnt for each for V0--- V5"
the 'ptr' willbe added an offset for each end in theloop:) .
your answers isimportant, I will testing and post the result.
In addition, when will be intrduced the Knights Corner? Do youhave the news in this regard?
 
					
				
				
			
		
					
					Reply
					
						
	
		
				
				
				
					
						
					
				
					
				
				
				
				
			
			Topic Options
			
				
					
	
			
		
	- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page