- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Intel Support,
I implemented a basic element-wise addition function using a manual for loop and AVX2 intrinsics. I compiled my code with optimization level -O3 and compared its performance with Intel IPP’s ippsAdd_32f function.
For small array sizes, both implementations show similar performance. However, for larger arrays, the IPP function performs significantly better. Initially, I thought this was due to better cache utilization and pipelining, but since I am already compiling with -O3, I wonder if there are additional techniques involved on the IPP side.
Could you please clarify whether IPP uses further optimizations, such as cache-aware tiling, software prefetching, or multi-threading with Intel TBB or other libraries?
Thank you in advance.
Best regards,
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
If your hand-written AVX2 addition code is well written, it can already fully utilize the SIMD capabilities of the CPU. IPP has limited room for optimization for small data sizes. Also, for small data sizes, the overhead of calling IPP functions (such as parameter checking and scheduling logic) may offset the advantages of its optimization.
IPP uses advanced optimization techniques such as loop unrolling, software pipelining, cache blocking, etc. These optimizations work well when the data size is large because they require a sufficient number of iterations to amortize the overhead. IPP internally doesn't use TBB or other threading libraries. While uses do benefit from tiling and threading technology.
Here is one article for your reference Tiling and Threading
Regards,
Ruqiu

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page