- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is there any way to speed up accumulating the absolute value of 3D dot product
INT16 *S,*D
or
INT32 *S,*D
int Sum = 0
for ( int i=0; i Sum+= abs((INT32)S[i*3]*D[i*3] + (INT32)S[i*3+1]*D[i*3+1] + (INT32)S[i*3+2]*D[i*3+2]);
N is in [100-200] range
This is part of the TBB task already, so no need to parallelize this accumulation.
Any ideas if IPP is any help here? Or SIMD intrinsics? Or anything else?
Thank you in advance!
INT16 *S,*D
or
INT32 *S,*D
int Sum = 0
for ( int i=0; i
N is in [100-200] range
This is part of the TBB task already, so no need to parallelize this accumulation.
Any ideas if IPP is any help here? Or SIMD intrinsics? Or anything else?
Thank you in advance!
Link Copied
7 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
No function in IPP that can add every three elements in the array. This part can write with some C loops. Other computation can use IPP functions, as bellow:
IPP32s * tmp1, *tmp2;
ippsMul_16s32s_Sfs(s, d, tmp1,3*N, 0);
for(i=0;i
tmp2=tmp1[3*i]+tmp1[3*i+1]+tmp1[3*i+2]
ippsAbs_16s_I(tmp2, N);
ippsSum_32s_Sfs(tmp2,N,&Sum,0)
Thanks,
Chao
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you very much for your reply. I will test performance of your code, compare it to my naive variant and post results.
-Artem
-Artem
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Chao,
Your code is pefroming correctly (small typo - ippsAbs_32s not 16s is needed).
However, it is 1.75 slower in my scenario than the plain C code (i.e. 245 sec vs. 140 sec)
I use VC++ 9 compiler though, but will try with Intel compiler soon.
Thanks again,
Artem
Your code is pefroming correctly (small typo - ippsAbs_32s not 16s is needed).
However, it is 1.75 slower in my scenario than the plain C code (i.e. 245 sec vs. 140 sec)
I use VC++ 9 compiler though, but will try with Intel compiler soon.
Thanks again,
Artem
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Artem,
I believe that this is something that can be accomplish using intrinsics :
1st option - given that 3 is quite an unusual number in our field of work, I can only recommend that you insert a 4th component being always zero. This should help processing even if there's some unused calculations. But memory consumption will grow by 1/3rd.
2nd option - store components in a planar manner instead of interleaved.
3rd option - don't touch how you're data is organized. This will require a little more thinking for an effective intrinsic SIMD implementation. The tricky part is the 3 additions, maybe you can find a sample for color to gray to inspire you with an implementation that suits your needs.
regards,
Matthieu
I believe that this is something that can be accomplish using intrinsics :
1st option - given that 3 is quite an unusual number in our field of work, I can only recommend that you insert a 4th component being always zero. This should help processing even if there's some unused calculations. But memory consumption will grow by 1/3rd.
2nd option - store components in a planar manner instead of interleaved.
3rd option - don't touch how you're data is organized. This will require a little more thinking for an effective intrinsic SIMD implementation. The tricky part is the 3 additions, maybe you can find a sample for color to gray to inspire you with an implementation that suits your needs.
regards,
Matthieu
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If it tooks 245 sec I assume your data buffer is really big. Another chance of optimization will be executing all functions on a buffer that fits into 2.lvl cache.
e.g.
e.g.
[cpp]for( int i=0; i<60000000; i+=1000000) //depends on size of cache and number of buffers for read/write { Ipp32s* p = pSource+i; func1( p); func2( p); func3( p); ... }[/cpp]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You are right, this function works on medical imaged, it is a 3D matrix, say, 256x256x256 with 16MB total.
This function is called for a single line, somewhere in [100,256] length, by 4 threads under TBB, with 100% thread utilization.
I can not predict what kind of machine it will be executed on as the customer base is diverse.
This function is called for a single line, somewhere in [100,256] length, by 4 threads under TBB, with 100% thread utilization.
I can not predict what kind of machine it will be executed on as the customer base is diverse.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
You can't predict on what kind of machine it will be executed but you can adapt your code to be flexible. IPP provides the ippGetMaxCacheSizeB which will return with size of the L2 or L3 cache. This will let you adapt how you can divide your data to get an efficient use of the cache.
Regards,
Matthieu
You can't predict on what kind of machine it will be executed but you can adapt your code to be flexible. IPP provides the ippGetMaxCacheSizeB which will return with size of the L2 or L3 cache. This will let you adapt how you can divide your data to get an efficient use of the cache.
Regards,
Matthieu

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page