topic Re: Fast 3D DCT for N~1M in Intel® oneAPI Math Kernel Library

Fast 3D DCT for N~1M

nikunj1729 — Sun, 25 Nov 2007 09:07:03 GMT

I am writing a 3D acoustics simulator for which I need to repeatedly compute a DCT (more precisely, DCT-II) on a 3D array of size in the range 128x128x128. I have been exploring all the options I have. Earlier, the only good option I had was FFTW, which unfortunately, doesn't SSE optimize its DCT functions. But now, MKL 10.0 has DCT support which is really great. But as I went through the docs, I didn't see any functions for doing 3D DCT transforms in the Trignometric toolset. All I saw were functions for doing 1D transformations. Of course, since internally the FFT interface is being used, and MKL's FFT works for 3D, it is possible to do a 3D transform too.

So, two questions:
1. Does MKL have an API for doing fast 3D DCTs?
2. A (real->real) DCT should ideally be about 4x faster than a complex->complex FFT on the same number of elements. Does MKL exhibit this behavior? How efficiently does it reduce the DCT to an FFT.

thanks a lot!!
-Nikunj.
--

Re: Fast 3D DCT for N~1M

mambru37 — Wed, 27 Feb 2008 21:20:14 GMT

I was looking for the same, but I wasn't able to find anything on the manual. Since the poisson solver works for 3d and it uses the DCT, it should be trivial to implement a working 3D DCT (provided the source code, which is not available :-)). Until then, I'll keep on using FFTW (which supports SSE btw).

Re: Fast 3D DCT for N~1M

nikunj1729 — Wed, 27 Feb 2008 21:29:17 GMT

Really, does FFTW SSE-optimize DCT as well? I know the older versions (when I had posted the message) didn't have SSE optimizations for DCT and they had mentioned this in the docs. There was some talk of putting it in for the R2R transforms, but I don't know if its actually been done.

thanks for the reply!
-Nikunj.
--

Re: Fast 3D DCT for N~1M

amyron — Thu, 28 May 2009 04:04:34 GMT

Hi,
Is there any updates related to this thread question?
Is the 3D DCT available in MKL? If yes, please give an example.

Thanks
Andriy

Re: Fast 3D DCT for N~1M

Alexander_K_Intel2 — Thu, 28 May 2009 04:51:36 GMT

Quoting - amyron

Hi,
Is there any updates related to this thread question?
Is the 3D DCT available in MKL? If yes, please give an example.

Thanks
Andriy

Hi Andriy,

3D DCT is not in MKL exactly but there is a variant to create it by combination of 1D DCT, for example:

tt_type = MKL_COSINE_TRANSFORM;
d_init_trig_transform(&n_x,&tt_type_x,ipar_x,dpar_x,&ir);
d_init_trig_transform(&n_y,&tt_type_y,ipar_y,dpar_y,&ir);
d_init_trig_transform(&n_z,&tt_type_z,ipar_z,dpar_z,&ir);
d_commit_trig_transform(f,&handle_x,ipar_x,dpar_x,&ir);
d_commit_trig_transform(f,&handle_y,ipar_y,dpar_y,&ir);
d_commit_trig_transform(f,&handle_z,ipar_z,dpar_z,&ir);

Loop over j,k {d_backward_trig_transform(f(:,j,k),&handle_x,ipar_x,dpar_x,&ir);}
Loop over i,k {d_backward_trig_transform(f(i,:,k),&handle_y,ipar_y,dpar_y,&ir);}
Loop over i,j {d_backward_trig_transform(f(i,j,:),&handle_z,ipar_z,dpar_z,&ir);}

free_trig_transform(&handle_x,ipar_x,&ir);
free_trig_transform(&handle_y,ipar_y,&ir);
free_trig_transform(&handle_z,ipar_z,&ir);

But if you want to see 3D DCT functionality in MKL you can file feature request at https://premier.intel.com
With best regards,
Alexander

Re: Fast 3D DCT for N~1M

zgsun — Wed, 02 Jun 2010 23:52:30 GMT

Hi, Alexander, is there openmp version of such trig_transform, or how can I use such trig_transform
associated with openmp? I have 3D variable and 2D of that variable need to be trig_transformed. So I want to
use such subroutines with openmp.

Could you please give me an example? I appreciate that. I tried many forms but I only succeed in without openmp case.

thanks!

Re: Fast 3D DCT for N~1M

Alexander_K_Intel2 — Thu, 03 Jun 2010 01:54:03 GMT

Hi!

There is not 2D version of TT with OpenMP parallelization but one can construct it from 1D TT himself. For example:

tt_type = MKL_COSINE_TRANSFORM;

d_init_trig_transform(&n_x,&tt_type_x,ipar_x,dpar_x,&ir);

d_init_trig_transform(&n_y,&tt_type_y,ipar_y,dpar_y,&ir);

ipar_x[9] = number_of_threads;

ipar_y[9] = number_of_threads;

d_commit_trig_transform(f,&handle_x,ipar_x,dpar_x,&ir);
d_commit_trig_transform(f,&handle_y,ipar_y,dpar_y,&ir);

#OMP parallel for

Loop over j {d_backward_trig_transform(f(:,j,k),&handle_x,ipar_x,dpar_x,&ir);}

#OMP end parallel

#OMP parallel for
Loop over i {d_backward_trig_transform(f(i,:,k),&handle_y,ipar_y,dpar_y,&ir);}

#OMP end parallel

free_trig_transform(&handle_x,ipar_x,&ir);
free_trig_transform(&handle_y,ipar_y,&ir);

Ipar[9] Specifies the number of OpenMP threads to run TT routines in the OpenMP

environment of the Poisson Library.

Is this variant suitable for you or not?

With best regards,

Alexander Kalinkin

Re: Fast 3D DCT for N~1M

zgsun — Thu, 03 Jun 2010 07:07:07 GMT

thank you so much!

I really make it work with 8 threads, but, only with variable f(33,33,10) , for f of other size, the results
are incorrect, strange. Could you please help me with some hints?

At the same time, I notice that the overhead problem for trig_transform is serious---with 8 threads, it
is even slower than using single threads! Surpringly! Maybe I make something wrong here?

thanks, zhigang

Re: Fast 3D DCT for N~1M

Gennady_F_Intel — Thu, 03 Jun 2010 07:21:27 GMT

Hi everybody,

are you interesingin C examples only or in F90 API also?

--Gennady

Re: Fast 3D DCT for N~1M

Alexander_K_Intel2 — Thu, 03 Jun 2010 07:24:13 GMT

zhigang,

it's a really strange situation. Is your test work correct on 1 threads? And what processors do you use?

With best regards,

Alexander Kalinkin

Re: Fast 3D DCT for N~1M

zgsun — Thu, 03 Jun 2010 15:27:59 GMT

Hi, Alexander, the results with 1 threads are just correct.
The machine model I am using is: Intel Xeon CPU E5430 @ 2.66GHz

In fact, I was delighted by the performace of trig_transform at first. Then I want to see what is its performance
with openMP. I appreciate that if you can help me with this. I tried in many ways but failed.

thanks!

Re: Fast 3D DCT for N~1M

Alexander_K_Intel2 — Wed, 09 Jun 2010 00:38:33 GMT

Hizhigang,

I've checked testcase prepared myself and it's work correctly. Could you attach here your example? And it would be really fine if this example notextremalbig!

With best regards,

Alexander Kalinkin