Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.
1120 Discussions

About vbroadcastss and vperm2f128&vshufps

peryli
Beginner
1,454 Views
Hi, everyone

in my program, the code like this:

for(...)
{
broadcast v6 , [ ptr ]
broadcast v7 , [ ptr+8 ]
...
broadcast v10, [ ptr+40 ]

vop( v0, v6 )
...
vop( v5, v10 )

broadcast v6 , [ ptr+1]
broadcast v7 , [ ptr+9 ]
...
broadcast v10, [ ptr+41 ]

vop( v0, v6 )
...
vop( v5, v10 )

......

broadcast v6 , [ ptr+5 ]
broadcast v7 , [ ptr+13 ]
...
broadcast v10, [ ptr+45 ]

vop( v0, v6 )
...
vop( v5, v10 )
}

my question isif usedvperm2f128 and shufps instead the vbroadcast, which is effectivity?

like this:

for(... )
{
vmovaps v6 , [ ptr ]
...
vmovaps v10, [ ptr ]

vperm2f128 v11, v6, v6,imm
vshufps v11, v11, v11, imm
vop( v0, v11 )
...
vperm2f128 v11, v11, v11,imm
vshufps v11, v11, v11, imm
vop( v5, v11 )

vperm2f128 v11, v6, v6,imm2
vshufps v11, v11, v11, imm2
vop( v0, v11 )
...
vperm2f128 v11, v11, v11,imm2
vshufps v11, v11, v11, imm2
vop( v5, v11 )
...


vperm2f128 v11, v6, v6,imm5
vshufps v11, v11, v11, imm5
vop( v0, v11 )
...
vperm2f128 v11, v11, v11,imm5
vshufps v11, v11, v11, imm5
vop( v5, v11 )
}

Thanks!

0 Kudos
5 Replies
Nicolae_P_Intel
Employee
1,454 Views
I would suggest to implement both approaches and run them through Intel Architecture Code Ananlyzer which you can download from http://software.intel.com/en-us/articles/intel-architecture-code-analyzer/.
0 Kudos
Brijender_B_Intel
1,454 Views

As suggested by Nicolae, use architecture code analyzer. It is just not the question of using vbroadcast or vperm/shuffle, the use also depends on other operations instructions. Perm/shuffle will put more pressure on port 5. if your code is not port 5 limited you may benefit from there.

Secondly, if you are broadcasting constants, i will suggest it may be defined as in broadcasted format:
const0, const1, const2 as
const0, const0, const0, const0, const0 const0, cosnt0, const0.
then you will avoid both broadcast and perms/shuffle. it will be only one load instruction.

0 Kudos
peryli
Beginner
1,454 Views
First ,thanks for Nicoale and Brijender!

I considered the question about the port 5 pressure, but will the latencyform port5 grather the latency form load data form host memory? There`s nothing information about this now.

In my program, each loop has 48 shuffler ops( if use the instead scheme )and 36 broadcast ops( the origon scheme,but 18 ops in fact due to 2 LS ports can be used). so now,the question is will the latency ofinstruction port pressure caused > the latency of data loads * 3 (except the first batchload batch, the next five batchsforecast good cache hits ).Ididn`t get the information about the instructionslatency and the data transfs latency, the AVX code analyzer can not be answer the fully question.
0 Kudos
Brijender_B_Intel
1,454 Views
Let us look at in a different way:
1. Approach 1: 48 shuffle + "x" numbers of loads.
2. 36 broadcasts - will have 36 loads and 36 shuffles.

Assuming someconstants for load latency and shuffle instructions, you can compute a comparable values for two approaches.
Another thing i noticed in your for loop (it may be typo) that you are loading V0 and V5 from one location "ptr". it may be differnt for each for V0--- V5. So, you will have 5 extra loads in both cases. But total # of loads in #1 are much less than in #2. However in both cases Port 5 (which executes shuffles) is loaded in addition to load ports. Thats why VOPS in algorithm plays a critical role. If they end up on port 5, you may be better to going to approach2. Also, if compiler can generate code to remove dependency from loads/shuffles (i mean interleave code with independent operations) it can hide this port pressure.
It is hard to know the algorithm performnce from looking only part of the instruction getting executed. Algorithm may have bottleneck totally in the end, where whole VOPS tree is collapsed.
0 Kudos
peryli
Beginner
1,454 Views
Thanks for Brijender.
"Another thing i noticed in your for loop (it may be typo) that you are loading V0 and V5 from one location "ptr". it may be differnt for each for V0--- V5"

the 'ptr' willbe added an offset for each end in theloop:) .
your answers isimportant, I will testing and post the result.

In addition, when will be intrduced the Knights Corner? Do youhave the news in this regard?
0 Kudos
Reply