- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In order to evaluate IPP as a solution for a particular problem, I've implemented an algorithm in both straight C code and using IPP signal processing algorithms. When benchmarking the code, I find the C code runs almost twice as fast as the IPP code. Because this is an eval version, all the IPP libraries are dynamic. I'm running on a dual Xeon, 3 GHz processor using RedHat4 Linux.
Here is a code fragment of the primary algorithm:
#if defined(IPP)
// ^tmpbuff = ^x - ^mu for vec_dim elements
ippsSub_32f(mu, x, tmpbuff, vec_dim);
// ^tmpbuff = ^tmpbuff * ^tmpbuff for vec_dim elements
ippsMul_32f_I(tmpbuff, tmpbuff, vec_dim);
// ^tmpbuff = ^tmpbuff * ^ivar for vec_dim elements
ippsMul_32f_I(ivar, tmpbuff, vec_dim);
// sum = SUM(^tmpbuff) over vec_dim elements
ippsSum_32f(tmpbuff, vec_dim, ∑, ippAlgHintFast);
*s_ptr += sum;
mu += vec_dim;
ivar += vec_dim;
#else
for (j = vec_dim-1; j >= (BLOCK_train-1);j-=BLOCK_train) /* groups of BLOCK_train */
{
tmp = (*x++ - *mu++); *s_ptr += tmp*tmp*(*ivar++);
tmp = (*x++ - *mu++); *s_ptr += tmp*tmp*(*ivar++);
tmp = (*x++ - *mu++); *s_ptr += tmp*tmp*(*ivar++);
tmp = (*x++ - *mu++); *s_ptr += tmp*tmp*(*ivar++);
tmp = (*x++ - *mu++); *s_ptr += tmp*tmp*(*ivar++);
tmp = (*x++ - *mu++); *s_ptr += tmp*tmp*(*ivar++);
tmp = (*x++ - *mu++); *s_ptr += tmp*tmp*(*ivar++);
}
#endif
All the buffers were allocated with ippMalloc so they'd have the correct alignment. The dimension of each row is 49 elements but I've padded them to 56 elements when invoking IPP by adding zeros to the end of each row vector thinking I needed to have a multiple of 32-bytes (or 8 floats) in a row.
I used oprofile to try and determine where the time is being spent and I found that there is a substantial amount of time being spent in functions named own_XXX_32f where XXX is Sub or Sum, or Mul. In fact, far more time is spent in these than is spent in the actual functions doing the work.
When I run a certain number of iterations over the C code version, it takes about 3.2 seconds. Running IPP version takes almost 5 seconds.
Any ideas as to what might be going on here?
Here is a code fragment of the primary algorithm:
#if defined(IPP)
// ^tmpbuff = ^x - ^mu for vec_dim elements
ippsSub_32f(mu, x, tmpbuff, vec_dim);
// ^tmpbuff = ^tmpbuff * ^tmpbuff for vec_dim elements
ippsMul_32f_I(tmpbuff, tmpbuff, vec_dim);
// ^tmpbuff = ^tmpbuff * ^ivar for vec_dim elements
ippsMul_32f_I(ivar, tmpbuff, vec_dim);
// sum = SUM(^tmpbuff) over vec_dim elements
ippsSum_32f(tmpbuff, vec_dim, ∑, ippAlgHintFast);
*s_ptr += sum;
mu += vec_dim;
ivar += vec_dim;
#else
for (j = vec_dim-1; j >= (BLOCK_train-1);j-=BLOCK_train) /* groups of BLOCK_train */
{
tmp = (*x++ - *mu++); *s_ptr += tmp*tmp*(*ivar++);
tmp = (*x++ - *mu++); *s_ptr += tmp*tmp*(*ivar++);
tmp = (*x++ - *mu++); *s_ptr += tmp*tmp*(*ivar++);
tmp = (*x++ - *mu++); *s_ptr += tmp*tmp*(*ivar++);
tmp = (*x++ - *mu++); *s_ptr += tmp*tmp*(*ivar++);
tmp = (*x++ - *mu++); *s_ptr += tmp*tmp*(*ivar++);
tmp = (*x++ - *mu++); *s_ptr += tmp*tmp*(*ivar++);
}
#endif
All the buffers were allocated with ippMalloc so they'd have the correct alignment. The dimension of each row is 49 elements but I've padded them to 56 elements when invoking IPP by adding zeros to the end of each row vector thinking I needed to have a multiple of 32-bytes (or 8 floats) in a row.
I used oprofile to try and determine where the time is being spent and I found that there is a substantial amount of time being spent in functions named own_XXX_32f where XXX is Sub or Sum, or Mul. In fact, far more time is spent in these than is spent in the actual functions doing the work.
When I run a certain number of iterations over the C code version, it takes about 3.2 seconds. Running IPP version takes almost 5 seconds.
Any ideas as to what might be going on here?
Link Copied
1 Reply
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It seems you want to compute the L2 norm of x-mu (or the square of the norm?).
You can try some more suitable IPP function like
ippiNormDiff_L2_32f_C1R
or maybe use dot product with ippsDotProd_32f.
You can try some more suitable IPP function like
ippiNormDiff_L2_32f_C1R
or maybe use dot product with ippsDotProd_32f.

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page