If the transforms are large then with 8 tranforms per node you may be short on RAM, because descriptor may require some memory too. What are your transform sizes and RAM/node?
[fortran]$OMP DO i=1,196 ..some code... FFTForward ... some code. FFTBAckward. $OMP END DO[/fortran]
I did not encounter such problems with OpenMPI.
In order to localize the issue I suggest that you comment your code in the loop and leave only calls to MKL.
That would help to make the first step towards the problem solution - whether the slowness is caused by MKL or your code.