vectorization (and parallelization?) in a single open mp thread

Youcef_K_ · ‎04-08-2013

Hi everyone,

I'm a new MKL user and i would like to know if BLAS routines and VML functions are vectorized?

I'm currently using openMP to apply a complex filter to an image with a quad core (hyperthreading desactivated). i divide the images in 4 equal parts and apply the process to each part, giving a processor for each. Inside this process, i have some loop that process vectors. I have replaced some of these loops (when it is possible) by a function of BLAS or VML but without gaining time... I expected vectorization to make the code faster than loops do. Am i wrong? Maybe the vectors should be larger than a certain size? Another question : I don't expect parallelization to be effective with BLAS or VML in my case because there are called in a single open mp thread, am i wrong?

I would be very grateful if someone can provide me some help.

Youcef

TimP · ‎04-08-2013

Level 1 BLAS functions, as well as VML, can't be expected to vectorize better than a normal vectorizing compiler, and will require long loops (size > 4000 ?) to be competitive. At the time MKL was introduced, gcc and MSVC were not auto-vectorizing, so there could be significant advantage in the library call.

You would likely wish to examine the vectorization report of your compiler to see whether the compiler itself is taking advantage of opportunities for vectorization.

If by "vectorized" you mean by low level methods such as intrinsics or asm, compiler auto-vectorization should be equally effective for most of these operations. There may be a few which current MSVC would not vectorize, as well as some for which gcc would require aggressive options such as -O3 -ffast-math to engage vectorization.

SergeyKostrov · ‎04-08-2013

I see that you didn't take into account cache lines limitations and check a Datasheet for your CPU on ark.intel.com. >>...I expected vectorization to make the code faster than loops do. Am i wrong? Maybe the vectors should be larger >>than a certain size?.. My question is how big is the source image?

Youcef_K_ · ‎04-08-2013

@TimP

Thanks for your answer. I'm gonna compile my code without replacing loops by VML or MKL calls, using intel compiler and see auto-vectorization effect.

@Sergey

In my process, the size of the image is not relevant because i divide it into 8x8 patchs and for each patch i gather into a vector, the 16 nearest patchs. So i deal with vector of size 16x(8x8) = 1024. Maybe vector size is too small according to TimP condition (>4000)

SergeyKostrov · ‎04-08-2013

>>...So i deal with vector of size 16x(8x8) = 1024. Maybe vector size is too small according to TimP condition (>4000)... You could try to increase it. As I mentioned It is possible that you're dealing with cache related issues. For example, prefetching ( with _mm_prefetch intrinsic function ) is very effective when a size of a data set to be processed is greater than some number and it depends on a CPU and cache line sizes.

TimP · ‎04-08-2013

_mm_prefetch will fetch data across page boundaries, while hardware prefetch does not. You may be able to get the same effect by setting compiler option -opt-prefetch. If your loops are vectorized and of length about 1000, this is not likely to help.

Andrey_N_Intel · ‎04-08-2013

Hi Youcef,

To get idea about performance of VML functions, have a look at the data available at http://software.intel.com/sites/products/documentation/doclib/mkl_sa/11/vml/functions/_performanceall.html. You might want to choose functions you are interested in and see how performance of functions depends on vector size, accuracy, threading, for example this link shows performance for exp() function: http://software.intel.com/sites/products/documentation/doclib/mkl_sa/11/vml/functions/exp.html

Intel MKL Manual suggests to use VML functions if vector size is at least 40 - I believe this is your case. Otherwise, use Intel compiler to vectorize the loops which contain calls to math functions.

Also, have a look at VML training materials which contain high level info about VML accuracy, performance, API and VML based banchmark. The presentation is at http://software.intel.com/en-us/articles/intel-mkl-vmlvsl-training-material

Quick experiments with VML and compiler based loops should additionaly help you to choose way which will result in performance gains of your application.

Andrey

Youcef_K_ · ‎04-09-2013

@Andrey

thanks for your help, that's exactly what i'm experimented today : for my application, VML better improves time execution than CBLAS does. But VML doesn't contain some vector operations i need like y=ax+y, y=ay, y = copy(x) and s = sum(y), am i wrong?

Do you think IPP functions (Copy, Saxpy, Mul in volume 3 : Small matrices and realistic rendering) can be used for small vector size with time improvement like with VML?

Andrey_N_Intel · ‎04-09-2013

Hi Youcef,

I wonder if VML LinearFrac function, z = ( A * x + B ) / ( C * y + D ) properly initialized, could be used for operations like y = A * y? Operation like y = A * x + y could be obtained as two calls to VML LinearFrac() and Add() functions. Use BLAS in those cases when it is impossible to present math functions of your app in the format which could rely on composition of VML functions, or BLAS functions are faster. Yes, it make sense to experiment with IPP functions for small vector sizes.

Andrey

TimP · ‎04-09-2013

I suppose a reason for not supporting these operations in VML is that they could be done more efficiently by compiler generated code or by BLAS function calls, according to the size.

For compiled code, you may wish to study the optimization pragmas and Extended Array Notation. Rather than combining performance library function calls to construct composite operations, I would recommend compiled code.