Intel® Integrated Performance Primitives
Deliberate problems developing high-performance vision, signal, security, and storage applications.

Performance evaluation of ippsAdd_32f and ippsSub_32f vs. a simple 2-for-loop implementation with /O3 optimization

SergeyKostrov
Valued Contributor II
537 Views

I've completed a performance evaluation of some linear algebra algorithm that uses ippsAdd_32f and ippsSub_32f IPP functions vs. a simple 2-for-loop implementation ( of the same functionality in the same algorithm ) compiled with /O3 ( Intel C++ compiler ) and /O2 ( Microsoft C++ compiler ) optimizations and my results are very interesting.

In a couple of words: There was just ~0.30% performance improvement when IPP functions are used and I would consider it as negligible. I also provide test results later.

Thanks and ask questions if interested.

 

0 Kudos
4 Replies
SergeyKostrov
Valued Contributor II
537 Views
A question to IDZ Administrators / Moderators: How could anyone edit an original ( 1st ) post of a just created thread? I remember that editing was available in the past.
0 Kudos
SergeyKostrov
Valued Contributor II
537 Views
[ Test results when IPP library is Not Used ] ... Calculating... Add - Completed in 3.554 ms Add - Completed in 3.395 ms Add - Completed in 3.525 ms Sub - Completed in 3.367 ms Sub - Completed in 3.127 ms Add - Completed in 3.126 ms Sub - Completed in 3.364 ms Add - Completed in 3.506 ms Sub - Completed in 3.491 ms Add - Completed in 3.441 ms Add - Completed in 3.103 ms Sub - Completed in 2.968 ms Add - Completed in 3.294 ms Add - Completed in 3.094 ms Add - Completed in 3.114 ms Sub - Completed in 2.777 ms Add - Completed in 2.756 ms Add - Completed in 3.009 ms ( Algorithm ) - Pass 1 - Completed: 75.89500 secs Add - Completed in 3.541 ms Add - Completed in 3.556 ms Add - Completed in 3.526 ms Sub - Completed in 3.384 ms Sub - Completed in 3.143 ms Add - Completed in 3.148 ms Sub - Completed in 3.363 ms Add - Completed in 3.419 ms Sub - Completed in 3.484 ms Add - Completed in 3.423 ms Add - Completed in 3.124 ms Sub - Completed in 3.084 ms Add - Completed in 2.904 ms Add - Completed in 3.202 ms Add - Completed in 3.128 ms Sub - Completed in 2.770 ms Add - Completed in 2.779 ms Add - Completed in 3.039 ms ( Algorithm ) - Pass 2 - Completed: 75.87800 secs ...
0 Kudos
SergeyKostrov
Valued Contributor II
537 Views
[ Test results when IPP library is Used ] ... Calculating... Add - Completed in 3.518 ms Add - Completed in 3.401 ms Add - Completed in 3.364 ms Sub - Completed in 3.280 ms Sub - Completed in 2.754 ms Add - Completed in 2.830 ms Sub - Completed in 3.280 ms Add - Completed in 3.311 ms Sub - Completed in 3.305 ms Add - Completed in 3.062 ms Add - Completed in 2.954 ms Sub - Completed in 2.595 ms Add - Completed in 2.790 ms Add - Completed in 3.178 ms Add - Completed in 3.177 ms Sub - Completed in 2.726 ms Add - Completed in 2.724 ms Add - Completed in 2.997 ms ( Algorithm ) - Pass 1 - Completed: 75.63000 secs Add - Completed in 3.500 ms Add - Completed in 3.381 ms Add - Completed in 3.443 ms Sub - Completed in 3.256 ms Sub - Completed in 2.773 ms Add - Completed in 2.839 ms Sub - Completed in 3.296 ms Add - Completed in 3.431 ms Sub - Completed in 3.290 ms Add - Completed in 3.062 ms Add - Completed in 2.955 ms Sub - Completed in 2.594 ms Add - Completed in 2.844 ms Add - Completed in 3.173 ms Add - Completed in 3.181 ms Sub - Completed in 2.742 ms Add - Completed in 3.123 ms Add - Completed in 2.938 ms ( Algorithm ) - Pass 2 - Completed: 75.61300 secs ...
0 Kudos
SergeyKostrov
Valued Contributor II
537 Views
With reduced output details... [ A larger Data set - Test 1 - Algorithm with IPP - faster for 0.29% then Test 2 ] ... Calculating... Algorithm - Pass 1 - Completed: 114.35901 secs Algorithm - Pass 2 - Completed: 114.10901 secs Algorithm - Pass 3 - Completed: 114.07801 secs Note: Best Time ( BT1 ) Algorithm - Pass 4 - Completed: 114.07901 secs Algorithm - Pass 5 - Completed: 114.09301 secs ... [ A larger Data set - Test 2 - Algorithm without IPP - slower for 0.29% then Test 1 ] ... Calculating... Algorithm - Pass 1 - Completed: 114.76601 secs Algorithm - Pass 2 - Completed: 114.40601 secs Note: Best Time ( BT2 ) Algorithm - Pass 3 - Completed: 114.40601 secs Algorithm - Pass 4 - Completed: 114.46901 secs Algorithm - Pass 5 - Completed: 114.42201 secs ... Hardware & Software details: Dell Precision Mobile M4700 Intel Core i7-3840QM ( Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/compare/70846 ) 32GB RAM 320GB HDD NVIDIA Quadro K1000M ( 192 CUDA cores / 2GB memory ) Windows 7 Professional 64-bit Size of L3 Cache = 8MB ( shared between all cores for data & instructions ) Size of L2 Cache = 1MB ( 256KB per core / shared for data & instructions ) Size of L1 Cache = 256KB ( 32KB per core for data & 32KB per core for instructions )
0 Kudos
Reply