I'm doing audio processing so of course there are lots of loops with floating point operations. I am attempting to use a combination of the Intel compiler and IPP to improve performance. I've managed to squeeze some performance out of the Intel compiler (as opposed to VC9) by following this document:
Is there a similar document regarding memory alignment WITH EXAMPLES?
1) I'm using ippsMalloc_ to initialize my buffers, and ippsIIRInitAlloc_ to initialize the filters which operate on those buffers, but I'm actually seeing worse performance in many cases than the equivalent inline C++. Obviously I'm doing something horribly wrong or there is just more for me to do in terms of alignment, but where do I begin? Do I go after the structures or classes which contain these buffers? their pointers?
2) Is my best possible result going to be some combination of Intel compiler directives and ipp calls?
3) Unfortunately I'm working with a large codebase... Is there some feature of the compiler, or vTune for example, that will help me identify prime candidates for alignment?
4) Out of curiosity, what exactly are the internal calls prefixed with "owns"... for example if I call ippsIIR_, I see that call plus w7_ownsippsIIR_ sucking up cycles... is that just the dispatcher at work? or is it indicative of something I may be doing wrong in terms of linking, etc.
Sorry for the basic questions, it's just that there is a vast amount of (seemingly unorganized) documentation, and I'm having trouble finding the docs that specifically address the issues I'm trying to resolve.
IPP memory allocation functions always return 16-byte aligned memory block.
It would be nice if you can attach simple test case where you face performance issue (point to consideration is amount of processed data, for example if you call IPP function to process single element, most probably you will not see performance improvement vs just C code).
Functions with "strange" prefixes are internal IPP functions. The most of the time expected to be spent in IPP internal functions. Dispatcher work only at initialization stage, after that call of each particular IPP function have no run-time overhead from dispatching (except one additional jump instruction to particular optimized code branch).
There are a lot of samples provided in IPP sample package.