- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I'm having optimization problems with MKL. I'm not sure whether I'm doing somthing wrong, or there is indeed a problem in this case (aka. it won't have benefits in my case).

I've made an implementation protype of the Black-Scholes algorithm for evaluating option prices, both using standard C functions, and MKL functions, by using the VML library. My problem is that the MKL implementation is much more slower than the normal float implementation. I've tried both single and multi threaded. Can someone please take a look and give me some advice/suggestion what else could I try? According to documentation this is a high-performance library. However, my results don't reflect this.

I've attached the code. Just uncomment the mkl_domain_set_num_threads() function. Also the makefile contains both single and multi threaded libraries. You just have to uncomment the corresponding lines.

Whenever I use Sequential linking:

**icpc -c -w1 -O2 -xsse4.2 -DMKL_ILP64 -I. -I/opt/intel/composerxe/include -I/opt/intel/mkl/include -o Black76.o Black76.cpp**

icpc -c -w1 -O2 -xsse4.2 -DMKL_ILP64 -I. -I/opt/intel/composerxe/include -I/opt/intel/mkl/include -o main.o main.cpp

icpc -L/opt/intel/mkl/lib/intel64 -L/opt/intel/lib/intel64 Black76.o main.o -o black76_intel -lrt -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lm

icpc -c -w1 -O2 -xsse4.2 -DMKL_ILP64 -I. -I/opt/intel/composerxe/include -I/opt/intel/mkl/include -o main.o main.cpp

icpc -L/opt/intel/mkl/lib/intel64 -L/opt/intel/lib/intel64 Black76.o main.o -o black76_intel -lrt -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lm

I'm getting the following performance results:

Completed

**1 passes in 0 : 001526118**seconds

Completed

**2 passes in 0 : 000007518**seconds

Completed

**3 passes in 0 : 000008536**seconds

Completed

**10 passes in 0 : 000026468**seconds

Completed 100 passes in 0 : 000329301 seconds

Completed 1000 passes in 0 : 002591126 seconds

Completed 10000 passes in 0 : 014796280 seconds

Completed 100000 passes in 0 : 147133308 seconds

Completed 1000000 passes in 1 : 465677079 seconds

Completed 10000000 passes in 14 : 714433962 seconds

It's also something odd here, because running 2 passes should not be quicker than running only one pass? There is huge difference between the 2, also running 3 doesn't reflect the reality either. Running even 100 passes is even quicker than the first one? This shouldn't happen.

When I compile with multi-threading I use the following options:

**icpc -L/opt/intel/mkl/lib/intel64 -L/opt/intel/lib/intel64 Black76.o main.o -o black76_intel -lrt -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm**

I will have to make 16 calculation repeatedly, so I defined ARRAYSZE=16, but I also tried increasing ARRAYSIZE to 16000, and enable multi threading, still sequential was faster than multithreaded. I'd like to improve performance with 16 calculations.

Can someone help me?

Please advice,

Thank you,

Eduard.

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Eduard,

thanks for your question, the code and detailed description of your environment.

We have several comments for your code which may help to improve the performance of your Black-Scholes benchmark:

1) By default, Intel Math Kernel Library runs High Accuracy version of Vector Math functions, while Compiler deafult is Lower Accuracy versions.

If your application does not require this level of accuracy, you might want to relax it using vmlSetMode as shown below:

vmlSetMode(VML_LA); // use Lower Accuracy version of the functions

or even

vmlSetMode(VML_EP); // using Enhanced Performance version of the functions

This will help you to get additional performance benefit for math functions.

Also, performance data and graphs available at

http://software.intel.com/sites/products/documentation/hpc/mkl/vml/functions/_performanceall.html

http://software.intel.com/sites/products/documentation/hpc/mkl/vml/functions/exp.html

etc would be useful to have an idea about performance of vector math functions

2) Modern processors can execute multiplication and addition add instructions in parallel, and Intel compiler can take advantage of that by proper scheduling of the instructions.

So, you might want to try using this piece of the code instead of vector Mul, Add, and Sqr. For example, please try this loop

for(j=0;j<
numPasses;j++)

{

volat2_temp

Numerator

}

instead of

vsMul(ARRAYSIZE,
volat2_temp, T, volat2_temp);

//compute
numerator = (log(S / X) + (v * v / 2) * T)

vsAdd(ARRAYSIZE,
log_temp, volat2_temp, numerator);

You also would receive better performance
results if you group as much such simple operations into one loop as possible
because the compiler will have better instruction scheduling possibilities.

3) Intel MKL math functions are expected to be threaded for vector length 16K, and this should give you additional performance benefit. Setting number of threads with Intel MKL service functionality would be probably useful as different functions are threaded differently on the same vector length. You might also want to apply a different approach by integrating parallelization into your application (this can be done, for examples, by using Open MP* directives); in this case, please use serial version of Intel MKL math functions.

Also, Intel MKL Manual suggests to call vector functions when vector length is at least several dozen elements. For small vector lengths, use of math functions available in Intel C++ compiler would be better choice.

4) You have some room for simplification of Balck-Schole formula (even more, if you consider that 2 of 5 arguments are constant)

5) During first call to Intel MKL functions the additional initialization is applied, thats why you see that the results for the 2-passes are better than for the 1-pass.

It is also worth noting that use of capabilities of Intel Compiler (such as vectorization, parallelization, architecture specific optimizations) in addition to features of Intel MKL would open more opportunities for performance gain on multi-core processors.

Please, let us know if you have more questions and comments on the optimization appraoches to the Black-Scholes benchmark, and we would gladely help.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

And I'd like to thank you for the detailed explanation and exemplification.

1) I will try setting the compiler to different accuracy levels, I'm really curios about the accuracy of the results, as well as the execution time.

2)I think at point 2 you meant:

for(j=0;j<

**ARRAYLENGTH**;j++)

{

volat2_temp

Numerator

}

This is what I wanted to try next, to use only the MKL exp, log, sqrt, and cnd, and use regular arithmetic functions for +, -, *, and /. Maybe DIV could be also used from the MKL library.

Basically I want to measure how long does it take to make those 16 calculations when I do 1 pass, 10 passes, 100 passes, and just want to see the real performance of the calculation in these cases.

3) Does it make sense to use

**#pragma omp parallel sections**to indicate a parrallel region (4 threads, each working on 4-element arrays)? I've had another implementation, using the math.h functions. I tried using OpenMP there, but the result wasn't good at all. On the other hand, I've seen examples of using parrallel sections for the QuickSort algorithm, so it should be something similar. Maybe I could use #pragma omp parallel for for the above example of element-by-element multiplications and additions.

Also, Does it help to turn hyper-threading off, and use a real-time kernel instead of the regular one? This is what I also want to try next.

On more remark here:

4) I've used constants in this simulation, but I'm not sure if it's gonna be the same in a real-time environment. Actually it's inaccurate, because calculations should be made for Options having the same Expiry date, so vector T should be constant. I will check the other parameters as well, and try to get some real data for the simulation.

I really appreciate your response, as I'm new in MKL programming and Intel CPU programming as well. So all you said is a great help to me.

One more question I'd like to ask you (I don't know if it's the place place to do so, but it worth a try), I've found a presentation on the internet made by Heinz Bast, Technical Consulting Engineer, Software Development Products, Intel Corporation: entitled

**A Case Study: Using Intel Parallel Studio XE to Optimize Black Scholes Calculation**. In the PDF file it's mentioned that the source code(s) can be freely obtained upon request from the presenter. Can I dowload it from somewhere? It would bee a good reference to see a highly optimized Black Scholes algorithm.

Thank you, and I'm looking forward to get a response to the above questions (or at least some of them).

Eduard.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

**#pragma omp parallel sections**or

**#pragma omp parallel for**can be used fine). On MKL side, no VML function is threaded on vectorlengths less than 100.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Eduard,

In addition to Ilya's answer I'd suggest to have a look at VML & VSL training materials available at http://software.intel.com/en-us/articles/intel-mkl-vmlvsl-training-material/.

This set of the slides describes features of Vector Maths Functions and Statistical functionality available in Intel Math Kernel Library. Slides 28-30 contain description of optimization approaches to Black-Scholes formula and related performance data.

Also, some when in future we think about postingwhite articles which, in particular, would demonstrate Intel SW based optimization approaches to Black Scholes and Monte Carlo version of European option pricing problem. Code samples would be part of those publications.

Please, feel free to ask more questions on Vector Math and Stat features of Intel MKL, and we will help.

Thanks,

Andrey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Illya, I have a similar problem. I see big difference between Sin which were computed by vdSin and sin() inside the loop. I use MS VS 2005 with Intel composer XE 2011 Update 6. Would you please say compiler's key for different accuracies of VML functions.

Thanks,

Dmitry.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

On MKL side, you can controll accuracy with special functions call:

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I'd like to thank you for the detailed instructions. Using Erf instead of CdfNorm, and replacing Add, Sub, Mul and Div with for loops, as well as setting accuracy to LA, had considerably improved program execution time.

You have been a great help.

I'm now looking at ArBB implemkentation, there is a black-scholes exmple included with the installation kit. I hope this one will be even better than MKL.

Eduard.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

It's still Black Scholes related, but not MKL.

I tried to use a different abbordation to the problem, and incorporate both float and double tests.

Please check my code whenever you may have some free time. In this version I'm still getting some odd results.

I'm now using SPAN data samples from Chicago Mercantile Exchange, however I seem to have the same problem with execution times. Which one is to be trusted at this time? Here is the output:

------------ Running Black76 Software benchmark ------------

RUNNING FLOAT TEST

Completed 1 passes in 0 :

**000009731**seconds

Completed 2 passes in 0 : 000001850 seconds

Completed 3 passes in 0 : 000002363 seconds

Completed 4 passes in 0 : 000003015 seconds

Completed 5 passes in 0 : 000003615 seconds

Completed 10 passes in 0 : 000006321 seconds

Completed 20 passes in 0 : 000013203 seconds

Completed 50 passes in 0 : 000029405 seconds

Completed 100 passes in 0 : 000058387 seconds

Completed 1000 passes in 0 : 000595579 seconds

Completed 10000 passes in 0 : 005931292 seconds

Completed 100000 passes in 0 : 042001457 seconds

RUNNING DOUBLE TEST

Completed 1 passes in 0 :

**000012223**seconds

Completed 2 passes in 0 : 000004050 seconds

Completed 3 passes in 0 : 000005351 seconds

Completed 4 passes in 0 : 000006255 seconds

Completed 5 passes in 0 : 000007663 seconds

Completed 10 passes in 0 : 000014439 seconds

Completed 20 passes in 0 : 000027367 seconds

Completed 50 passes in 0 : 000067247 seconds

Completed 100 passes in 0 : 000133634 seconds

Completed 1000 passes in 0 : 001369065 seconds

Completed 10000 passes in 0 : 013721015 seconds

Completed 100000 passes in 0 : 086685354 seconds

My concerc is the first pass, when again I'm getting much higher execution time, than later on. And I'm not using any MKL functions at this time. Is it still necessarry for the Intel compiler to make some initializations at first call of math.h functions? Or is it related to the fact that the sample data is the same in later passes? Can we trust these results? Please advice. I attached both the code and the Makefile. It outputs the result into a .csv file, and also execution time is displayed in the console.

I tried both -O2 and -O3 compiler options, the results are pretty much the same.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

**test**[ARRAYSIZE];

**test**= compute_Black76_float('C', S

**result**[ARRAYSIZE];

**result**= compute_Black76_float('C', S

**result**array is located in exactly same place on stack, and that turn to be already in cache.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

So no matter what input data do I use (even if I use different set of input data in each run), should I use result vector in the warming run as well?

Can I just initialize each element with value 0 for example?

I made the warming run with a different vector, I thought maybe the calculate_Black76_float() function might need some initialization, so in this way it is copied onto the stack and remains there throughout the execution time of the program, but it seems I was wrong. So in order to get real-time measurment, I understand that in the warming run we should use the result vector, or at least it needs to be initialized with some value to keep it in the stack before the real-time measurements begin.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

**>>...both using standard C functions...**

Your **C++** prototype could be improved if **C++ templates** are used. You're duplicating codes for

'float' and 'double' data types. What if some time later you will need to do calculations for a'long double' datatype?

Please take a look at a prototype of the **Black-Scholes** Algorithm with **C++ templates**:

...

template < class T > class **TBlackScholes**

{

public:

**TBlackScholes**( void )

{

Init();

};

virtual ~**TBlackScholes**( void )

{

};

virtual void **RunTest**( int iNumPasses )

{

//...

};

private:

void **Init**( void )

{

tPI = ( T )3.14159265358979323846;

tA1 = ( T ) 0.31938153;

tA2 = ( T )-0.356563782;

tA3 = ( T ) 1.781477937;

tA4 = ( T )-1.821255978;

tA5 = ( T ) 1.330274429;

tANeeded = ( T )0.3989423;

tKNeeded = ( T )0.2316419;

};

T **Compute**( char chFlag, T tS, T tX, T tT, T tR, T tV )

{

//...

};

T **ComputeCND**( T tX )

{

//...

};

private:

T tPI;

T tA1;

T tA2;

T tA3;

T tA4;

T tA5;

T tANeeded;

T tKNeeded;

};

...

void **main**( void )

{

...

// Test for '**float**' datatype

**TBlackScholes**< float > fBS;

fBS.RunTest( 1 );

fBS.RunTest( 10 );

fBS.RunTest( 100 );

fBS.RunTest( 1000 );

fBS.RunTest( 10000 );

// Test for '**double**' datatype

**TBlackScholes**< double > dBS;

dBS.RunTest( 1 );

dBS.RunTest( 10 );

dBS.RunTest( 100 );

dBS.RunTest( 1000 );

dBS.RunTest( 10000 );

...

}

Best regartds,

Sergey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page