IPP memory allocation functions always return 16-byte aligned memory block.
It would be nice if you can attach simple test case where you face performance issue (point to consideration is amount of processed data, for example if you call IPP function to process single element, most probably you will not see performance improvement vs just C code).
Functions with "strange" prefixes are internal IPP functions. The most of the time expected to be spent in IPP internal functions. Dispatcher work only at initialization stage, after that call of each particular IPP function have no run-time overhead from dispatching (except one additional jump instruction to particular optimized code branch).
There are a lot of samples provided in IPP sample package.