@stalls: there could also be

srinivasu · ‎03-04-2014

Hi,

We are developing and optimizing codec on Intel architecture with assembly optimization by finding most time consuming functions/modules through vtune amplifier.

I have more basic questions, please clarify

How to find stalls presents in the assembly, if so how to remove this. Only re-ordering is the solution?
Is there any possibility to know what are the instructions pipelined?.
Confusion is there whether intrinsic optimization or assembly optimization programming gives the better performance. Of course if portability required intrinsic programming is good, but looking for better performance.
Are the IPPs are license-free?
What are the basic strategies/steps in writing and optimizing the assembly function?. If you have any document related this during IPP implementation, please share.

Regards,

Srinivasu

Bernard · ‎03-04-2014

>>>Confusion is there whether intrinsic optimization or assembly optimization programming gives the better performance. Of course if portability required intrinsic programming is good, but looking for better performance>>>

If intrinsincs are compiled into the same machine code instruction like inline assembly there should not be any difference in the speed of execution.

Bernard · ‎03-04-2014

>>>How to find stalls presents in the assembly, if so how to remove this. Only re-ordering is the solution?>>>

I can think about the one case where VTune can find stalls in certain sequence of assembly instruction.It is called flag merge stalls and this is related to shl cl instruction.

http://software.intel.com/en-us/node/497362

Matthias_H_Intel · ‎03-13-2014

@stalls: there could also be stalls e.g. due to branch mispredictions ... - ilyapolak mentioned VTune. With that you should be able to find various kinds of stalls

@intrinsics vs asm: In general you obviously can get any possible performance using asm. Whether you will achieve maximum performance is a different question. There are some cases where you might get better performance with asm than with intrinsics (e.g. if you don't follow standard function prologue). However, with asm you most likely will have to optimize for each and every new hw architecture again.

Intrinsics on the other hand can sometimes even outperform your asm code as the compiler can e.g. make use of the knowledge of registers currently in use and possibly more effectively assign then with asm counterpart where the compiler has no flexibility.

In most cases I personally would recommend using intrinsics rather than asm:

- gives compiler chance to optimize e.g. on register usage

- easier transition to next gen hw via slightly higher level of abstraction

Depending on what exactly you need to optimize there might be further alternatives:

- for signal / image processing: use optimized Intel IPP libraries

- C++ vector classes included in Intel compiler

- Cilk+ array notation which is supported in Intel's compiler

@IPP licenses: for standard IPP check http://software.intel.com/en-us/intel-ipp

@5: don't quite get it. For general optimization you may check 64-ia-32-architectures-software-developer manuals

Bernard · ‎03-14-2014

>>>With that you should be able to find various kinds of stalls>>>

Yes of course I should have mentioned that.

srinivasu · ‎03-14-2014

Thanks for giving valuable inputs...any options are there in the vtune amplifier?...I am unable to find those options. Please provide more inputs how to look stalls with vtune amplifier.

Matthias_H_Intel · ‎03-14-2014

srinivasu wrote:

Thanks for giving valuable inputs...any options are there in the vtune amplifier?...I am unable to find those options. Please provide more inputs how to look stalls with vtune amplifier.

A good starting point might be the Intel® VTune™ Amplifier Tutorials coming with the documentation. There is e.g. a C++ tutorial section on "Identifying Hardware Issues": "Identify the hardware-related issues in your application such as data sharing, cache misses, branch misprediction, and others."

For a real expert level deep-dive it comes down to really understanding the underlying architecture ("Front-End" / "Back-End"), where bottlenecks / stalls may happen. The number of execution units, various buffer sizes, ... are all important to understand. Then you can pick the relevants VTune events and analyze the behavior of your application.

But maybe switching back gears. Neither know your SW nor your expertise - hence risking to state the obvious:

1. Check/improve your algorithm: If you're on O(n^2) it's probably not much worth optimizing if you could do in O(n log n)
1a. check your data types - e.g. do you need doubles or can you do with floats?
1b. check for vectorization potential e.g. on loops -> might require algorithmic changes to reasonably make use of vectorization
1c. check for parallelization - Intel Advisor can help you finding places where parallelization makes sense (under the assumption that there is a reasonable transition from serial to parallel - might as well require a complete redesign of your SW)
2. Analyze where to optimize (maybe check what DE Knuth told about "premature optimization" ;-)) - VTune can help you to find hotspots. Optimizing a cold spot which is hit once in the execution obviously won't help
3. check whether your compiler could already optimize for you (e.g. "ipo" on Intel compiler)
...

Bernard · ‎03-14-2014

>>>Is there any possibility to know what are the instructions pipelined?.>>>

Do you want to know which assembly instructions are/will be pipelined?

Bernard · ‎03-14-2014

>>>For a real expert level deep-dive it comes down to really understanding the underlying architecture ("Front-End" / "Back-End"), where bottlenecks / stalls may happen>>>

Another example of Back-End bound stall/bottleneck could be excessive usage of long latency div instruction in its floating point version divss( latency 21~29 cycles).In such a case the simplest optimization could be usage of rcpss(latency ~5 cycles) instruction which latency is ~4x smaller.

Matthias_H_Intel · ‎03-14-2014

iliyapolak wrote:

>>>Is there any possibility to know what are the instructions pipelined?.>>>

Do you want to know which assembly instructions are/will be pipelined?

Well, it might make sense to understand which instructions can be pipelined on each port, which instructions / conditions will flush the pipeline. E.g. to prevent oversubscription on a port ... - however, that's nothing generic but you need to check the hw details for any Intel hw

Bernard · ‎03-14-2014

I think that at deeper level of floating point stack there will be also pipelining at uops level.For example during the execution of divss uop probably some part of mul and adder can be used to start processing the next instruction.

Just guessing.

srinivasu · ‎07-18-2014

Hi,

This is further query to get more info. I am optimizing signal processing application with SSE4.1 intrinsics for atom x64 processor. The application/function can be optimized with SIMD instructions, with these I am processing 4 elements(data size of 32 bit) at a time instead of single element operations. Ideally we should save 3/4 of time, but hardly I am saving 2/4 of time. I am new to this architecture optimization.

The program involves more operations. There might be wrong the way I am using intrinsics. Some basic rules I am fallowing like below,

1. I am not restricting the no of variables used should be or below 16 xmm registers, instead freely I am using more variables.

2. I am not checking whether the address is aligned or not in accessing memory, imply I always using non-aligned memory access instructions.

3. I am not doing loop unrolling, I don't know how compiler behaves when loop unrolled ( my applications requires more than 16 bit register usage).

4. In the Intrinsic programming there are hotspots shown by vtune amplifier. Is compiler will not remove stalls for Intrinsic programming.

Please clarify where I am wrong and advise any further guidelines to fallow in optimizing any application and also share solid optimization guidelines documents for Intel optimization.

srinivasu · ‎07-18-2014

typo error corrected :

3. I am not doing loop unrolling, I don't know how compiler behaves when loop unrolled ( my function requires more than 16 xmm registers).

TimP · ‎07-18-2014

If assembly programming removes your motivation for loop unrolling, you would often be better off with plain source code. Beginning with Intel corei7 CPUs, unroll by 4 has more often than not appeared to be optimum in my benchmarks (although, for AVX, this may depend on the improvements in vector remainders shown by current beta compiler).

gnu compilers apply command line based unrolling to intrinsics code, so there is an opportunity to test it conveniently.

With both Intel and gnu compilers, you require the command line options to get effective levels of unrolling (but Intel compilers don't apply unrolling to intrinsics).

Due to shadow register renaming, unrolling is usually effective even though you have run out of named xmm registers. Some of the stalls which may be evident in VTune as instructions spending time waiting for operands to become available may be overcome by unrolling. "Riffling" so as to overcome latency of reductions is a technique which uses up more named xmm registers, as it is usually applied, so you may be able to improve on what the compilers do with parallel reductions, by taking account of which operations are eligible for renaming.

I just ran into cases today where the compilers choose to generate AVX-128 code when the architecture option is stepped up to AVX2, but even with intrinsics the compiler doesn't always use literally what you wrote. In some such cases, Intel may respond to bug report submissions. I hope you aren't counting on those bugs remaining so as to justify the work you put into assembly.

SSE4.1 unaligned instructions seem fairly effective since corei7, but there is still an advantage in 32-byte data alignment. If assembly programming is inhibiting you from taking advantage of newer instruction sets, I would be concerned that it's counter-productive. Intel C++ takes some liberty to implement intrinsics according to the architecture set on command line.

You asked early about licensing of IPP. If you have a full license (not student, non-commercial, beta) which entitles you to distribute binaries of your application, you are entitled to use static linking or redistributable versions of the libraries which come with the compiler. You and your legal department should read the licenses.

You asked some questions which could only be answered fully by proprietary pipe tracing, which isn't generally feasible.

Optimization guidelines