MKL optimization problem: VML functions (sequential and threaded)

zeusz4u · ‎02-01-2012

Hi everyone,

I'm having optimization problems with MKL. I'm not sure whether I'm doing somthing wrong, or there is indeed a problem in this case (aka. it won't have benefits in my case).

I've made an implementation protype of the Black-Scholes algorithm for evaluating option prices, both using standard C functions, and MKL functions, by using the VML library. My problem is that the MKL implementation is much more slower than the normal float implementation. I've tried both single and multi threaded. Can someone please take a look and give me some advice/suggestion what else could I try? According to documentation this is a high-performance library. However, my results don't reflect this.

I've attached the code. Just uncomment the mkl_domain_set_num_threads() function. Also the makefile contains both single and multi threaded libraries. You just have to uncomment the corresponding lines.

Whenever I use Sequential linking:
icpc -c -w1 -O2 -xsse4.2 -DMKL_ILP64 -I. -I/opt/intel/composerxe/include -I/opt/intel/mkl/include -o Black76.o Black76.cpp
icpc -c -w1 -O2 -xsse4.2 -DMKL_ILP64 -I. -I/opt/intel/composerxe/include -I/opt/intel/mkl/include -o main.o main.cpp
icpc -L/opt/intel/mkl/lib/intel64 -L/opt/intel/lib/intel64 Black76.o main.o -o black76_intel -lrt -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lm

I'm getting the following performance results:
Completed 1 passes in 0 : 001526118 seconds
Completed 2 passes in 0 : 000007518 seconds
Completed 3 passes in 0 : 000008536 seconds
Completed 10 passes in 0 : 000026468 seconds
Completed 100 passes in 0 : 000329301 seconds
Completed 1000 passes in 0 : 002591126 seconds
Completed 10000 passes in 0 : 014796280 seconds
Completed 100000 passes in 0 : 147133308 seconds
Completed 1000000 passes in 1 : 465677079 seconds
Completed 10000000 passes in 14 : 714433962 seconds

It's also something odd here, because running 2 passes should not be quicker than running only one pass? There is huge difference between the 2, also running 3 doesn't reflect the reality either. Running even 100 passes is even quicker than the first one? This shouldn't happen.

When I compile with multi-threading I use the following options:
icpc -L/opt/intel/mkl/lib/intel64 -L/opt/intel/lib/intel64 Black76.o main.o -o black76_intel -lrt -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm

I will have to make 16 calculation repeatedly, so I defined ARRAYSZE=16, but I also tried increasing ARRAYSIZE to 16000, and enable multi threading, still sequential was faster than multithreaded. I'd like to improve performance with 16 calculations.

Can someone help me?

Please advice,

Thank you,
Eduard.

Ilya_B_Intel · ‎02-01-2012

Hi Eduard,

thanks for your question, the code and detailed description of your environment.

We have several comments for your code which may help to improve the performance of your Black-Scholes benchmark:

1) By default, Intel Math Kernel Library runs High Accuracy version of Vector Math functions, while Compiler deafult is Lower Accuracy versions.

If your application does not require this level of accuracy, you might want to relax it using vmlSetMode as shown below:

vmlSetMode(VML_LA); // use Lower Accuracy version of the functions

or even

vmlSetMode(VML_EP); // using Enhanced Performance version of the functions

This will help you to get additional performance benefit for math functions.

Also, performance data and graphs available at

http://software.intel.com/sites/products/documentation/hpc/mkl/vml/functions/_performanceall.html

http://software.intel.com/sites/products/documentation/hpc/mkl/vml/functions/exp.html

etc would be useful to have an idea about performance of vector math functions

2) Modern processors can execute multiplication and addition add instructions in parallel, and Intel compiler can take advantage of that by proper scheduling of the instructions.

So, you might want to try using this piece of the code instead of vector Mul, Add, and Sqr. For example, please try this loop

for(j=0;j< numPasses;j++)
{
volat2_temp = volat2_temp*T;
Numerator = log_temp + volat2_temp;
}

instead of

vsMul(ARRAYSIZE, volat2_temp, T, volat2_temp);
//compute numerator = (log(S / X) + (v * v / 2) * T)
vsAdd(ARRAYSIZE, log_temp, volat2_temp, numerator);

You also would receive better performance results if you group as much such simple operations into one loop as possible because the compiler will have better instruction scheduling possibilities.

3) Intel MKL math functions are expected to be threaded for vector length 16K, and this should give you additional performance benefit. Setting number of threads with Intel MKL service functionality would be probably useful as different functions are threaded differently on the same vector length. You might also want to apply a different approach by integrating parallelization into your application (this can be done, for examples, by using Open MP* directives); in this case, please use serial version of Intel MKL math functions.

Also, Intel MKL Manual suggests to call vector functions when vector length is at least several dozen elements. For small vector lengths, use of math functions available in Intel C++ compiler would be better choice.

4) You have some room for simplification of Balck-Schole formula (even more, if you consider that 2 of 5 arguments are constant)

5) During first call to Intel MKL functions the additional initialization is applied, thats why you see that the results for the 2-passes are better than for the 1-pass.

It is also worth noting that use of capabilities of Intel Compiler (such as vectorization, parallelization, architecture specific optimizations) in addition to features of Intel MKL would open more opportunities for performance gain on multi-core processors.

Please, let us know if you have more questions and comments on the optimization appraoches to the Black-Scholes benchmark, and we would gladely help.

zeusz4u · ‎02-02-2012

Hello Ilya,

And I'd like to thank you for the detailed explanation and exemplification.

1) I will try setting the compiler to different accuracy levels, I'm really curios about the accuracy of the results, as well as the execution time.

2)I think at point 2 you meant:
for(j=0;j< ARRAYLENGTH;j++)
{
volat2_temp = volat2_temp*T;
Numerator = log_temp + volat2_temp;
}

This is what I wanted to try next, to use only the MKL exp, log, sqrt, and cnd, and use regular arithmetic functions for +, -, *, and /. Maybe DIV could be also used from the MKL library.

Basically I want to measure how long does it take to make those 16 calculations when I do 1 pass, 10 passes, 100 passes, and just want to see the real performance of the calculation in these cases.

3) Does it make sense to use #pragma omp parallel sections to indicate a parrallel region (4 threads, each working on 4-element arrays)? I've had another implementation, using the math.h functions. I tried using OpenMP there, but the result wasn't good at all. On the other hand, I've seen examples of using parrallel sections for the QuickSort algorithm, so it should be something similar. Maybe I could use #pragma omp parallel for for the above example of element-by-element multiplications and additions.

Also, Does it help to turn hyper-threading off, and use a real-time kernel instead of the regular one? This is what I also want to try next.

On more remark here:
4) I've used constants in this simulation, but I'm not sure if it's gonna be the same in a real-time environment. Actually it's inaccurate, because calculations should be made for Options having the same Expiry date, so vector T should be constant. I will check the other parameters as well, and try to get some real data for the simulation.

I really appreciate your response, as I'm new in MKL programming and Intel CPU programming as well. So all you said is a great help to me.

One more question I'd like to ask you (I don't know if it's the place place to do so, but it worth a try), I've found a presentation on the internet made by Heinz Bast, Technical Consulting Engineer, Software Development Products, Intel Corporation: entitled A Case Study: Using Intel Parallel Studio XE to Optimize Black Scholes Calculation. In the PDF file it's mentioned that the source code(s) can be freely obtained upon request from the presenter. Can I dowload it from somewhere? It would bee a good reference to see a highly optimized Black Scholes algorithm.

Thank you, and I'm looking forward to get a response to the above questions (or at least some of them).

Eduard.

Ilya_B_Intel · ‎02-02-2012

I would say, that for most platforms threading of BS formula with vector length 16 is not really reasonable -threading overhead will overcome all benefits. If you may need larger computations it will make sense (and either #pragma omp parallel sections or#pragma omp parallel for can be used fine). On MKL side, no VML function is threaded on vectorlengths less than 100.

I saw some visible benefits fromhyper-threading turned on on Black Scholes benchmark, though, again, vector lengths were much higher than 16.

You can also consider using vsInvSqrt instead ofvsSqrt +vsDiv, try to limit number of divisions (BS formula with 2 fixed arguments and 3 array arguments can be done with only 1 vsDiv), consider usingvsErf instead of vsCdfNorm (because of some mathematical properties of those functions, sometimes it is quicker to do Erf+scaling than CdfNorm).

Andrey_N_Intel · ‎02-02-2012

Hi Eduard,

In addition to Ilya's answer I'd suggest to have a look at VML & VSL training materials available at http://software.intel.com/en-us/articles/intel-mkl-vmlvsl-training-material/.
This set of the slides describes features of Vector Maths Functions and Statistical functionality available in Intel Math Kernel Library. Slides 28-30 contain description of optimization approaches to Black-Scholes formula and related performance data.
Also, some when in future we think about postingwhite articles which, in particular, would demonstrate Intel SW based optimization approaches to Black Scholes and Monte Carlo version of European option pricing problem. Code samples would be part of those publications.

Please, feel free to ask more questions on Vector Math and Stat features of Intel MKL, and we will help.

Thanks,
Andrey

dmitry_k · ‎02-03-2012

Illya, I have a similar problem. I see big difference between Sin which were computed by vdSin and sin() inside the loop. I use MS VS 2005 with Intel composer XE 2011 Update 6. Would you please say compiler's key for different accuracies of VML functions.

Thanks,

Dmitry.

Ilya_B_Intel · ‎02-03-2012

Dmitry,

On MKL side, you can controll accuracy with special functions call:

vmlSetMode([VML_HA|VML_LA|VML_EP])

On Intel Compiler vectorized math functions side, you can control accuracy with swirches:

-fimf-precision=[high|medium|low]

See Compiler doc for more details:

http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/cpp/lin/copts/common_options/option_fimf_precision.htm

dmitry_k · ‎02-04-2012

Ilya, thanks a lot.

zeusz4u · ‎02-06-2012

Hello Ilya and Andrey,

I'd like to thank you for the detailed instructions. Using Erf instead of CdfNorm, and replacing Add, Sub, Mul and Div with for loops, as well as setting accuracy to LA, had considerably improved program execution time.

You have been a great help.

I'm now looking at ArBB implemkentation, there is a black-scholes exmple included with the installation kit. I hope this one will be even better than MKL.

Eduard.

zeusz4u · ‎02-09-2012

I have another question/concern, in wich you may be able to help me out.
It's still Black Scholes related, but not MKL.

I tried to use a different abbordation to the problem, and incorporate both float and double tests.

Please check my code whenever you may have some free time. In this version I'm still getting some odd results.

I'm now using SPAN data samples from Chicago Mercantile Exchange, however I seem to have the same problem with execution times. Which one is to be trusted at this time? Here is the output:

------------ Running Black76 Software benchmark ------------
RUNNING FLOAT TEST
Completed 1 passes in 0 : 000009731 seconds
Completed 2 passes in 0 : 000001850 seconds
Completed 3 passes in 0 : 000002363 seconds
Completed 4 passes in 0 : 000003015 seconds
Completed 5 passes in 0 : 000003615 seconds
Completed 10 passes in 0 : 000006321 seconds
Completed 20 passes in 0 : 000013203 seconds
Completed 50 passes in 0 : 000029405 seconds
Completed 100 passes in 0 : 000058387 seconds
Completed 1000 passes in 0 : 000595579 seconds
Completed 10000 passes in 0 : 005931292 seconds
Completed 100000 passes in 0 : 042001457 seconds
RUNNING DOUBLE TEST
Completed 1 passes in 0 : 000012223 seconds
Completed 2 passes in 0 : 000004050 seconds
Completed 3 passes in 0 : 000005351 seconds
Completed 4 passes in 0 : 000006255 seconds
Completed 5 passes in 0 : 000007663 seconds
Completed 10 passes in 0 : 000014439 seconds
Completed 20 passes in 0 : 000027367 seconds
Completed 50 passes in 0 : 000067247 seconds
Completed 100 passes in 0 : 000133634 seconds
Completed 1000 passes in 0 : 001369065 seconds
Completed 10000 passes in 0 : 013721015 seconds
Completed 100000 passes in 0 : 086685354 seconds

My concerc is the first pass, when again I'm getting much higher execution time, than later on. And I'm not using any MKL functions at this time. Is it still necessarry for the Intel compiler to make some initializations at first call of math.h functions? Or is it related to the fact that the sample data is the same in later passes? Can we trust these results? Please advice. I attached both the code and the Makefile. It outputs the result into a .csv file, and also execution time is displayed in the console.

I tried both -O2 and -O3 compiler options, the results are pretty much the same.

Ilya_B_Intel · ‎02-10-2012

Eduard,

You are looking at effect of cold cache.

I see that you are using different output arrays in your "warming" run and "real" run.

float test[ARRAYSIZE];

...

test = compute_Black76_float('C', S, X, T, R, V);

float result[ARRAYSIZE];

...

result = compute_Black76_float('C', S, X, T, R, V);

That results in the following: during the first function call after the warming run your input arrays are in cache, but your real output array is not in cache yet.

The next time you run the same function (and it does not matter how many passes will be requested) result array is located in exactly same place on stack, and that turn to be already in cache.

Which performance result will be more relevant in your case depends on final application usage model (will results and input array be in cache before the function call or not). It is defenitely worth trying to keep them in cache.

Ilya

zeusz4u · ‎02-10-2012

Thank you Ilya.

So no matter what input data do I use (even if I use different set of input data in each run), should I use result vector in the warming run as well?

Can I just initialize each element with value 0 for example?

I made the warming run with a different vector, I thought maybe the calculate_Black76_float() function might need some initialization, so in this way it is copied onto the stack and remains there throughout the execution time of the program, but it seems I was wrong. So in order to get real-time measurment, I understand that in the warming run we should use the result vector, or at least it needs to be initialized with some value to keep it in the stack before the real-time measurements begin.

SergeyKostrov · ‎02-13-2012

>>...both using standard C functions...

Your C++ prototype could be improved if C++ templates are used. You're duplicating codes for
'float' and 'double' data types. What if some time later you will need to do calculations for a'long double' datatype?

Please take a look at a prototype of the Black-Scholes Algorithm with C++ templates:

...
template < class T > class TBlackScholes
{
public:
TBlackScholes( void )
{
Init();
};
virtual ~TBlackScholes( void )
{
};

virtual void RunTest( int iNumPasses )
{
//...
};

private:
void Init( void )
{
tPI = ( T )3.14159265358979323846;

tA1 = ( T ) 0.31938153;
tA2 = ( T )-0.356563782;
tA3 = ( T ) 1.781477937;
tA4 = ( T )-1.821255978;
tA5 = ( T ) 1.330274429;
tANeeded = ( T )0.3989423;
tKNeeded = ( T )0.2316419;
};

T Compute( char chFlag, T tS, T tX, T tT, T tR, T tV )
{
//...
};

T ComputeCND( T tX )
{
//...
};

private:
T tPI;

T tA1;
T tA2;
T tA3;
T tA4;
T tA5;
T tANeeded;
T tKNeeded;
};
...

void main( void )
{
...
// Test for 'float' datatype
TBlackScholes< float > fBS;

fBS.RunTest( 1 );
fBS.RunTest( 10 );
fBS.RunTest( 100 );
fBS.RunTest( 1000 );
fBS.RunTest( 10000 );

// Test for 'double' datatype
TBlackScholes< double > dBS;

dBS.RunTest( 1 );
dBS.RunTest( 10 );
dBS.RunTest( 100 );
dBS.RunTest( 1000 );
dBS.RunTest( 10000 );
...
}

Best regartds,
Sergey