thanks for your question, the code and detailed description of your environment.
We have several comments for your code which may help to improve the performance of your Black-Scholes benchmark:
1) By default, Intel Math Kernel Library runs High Accuracy version of Vector Math functions, while Compiler deafult is Lower Accuracy versions.
If your application does not require this level of accuracy, you might want to relax it using vmlSetMode as shown below:
vmlSetMode(VML_LA); // use Lower Accuracy version of the functions
vmlSetMode(VML_EP); // using Enhanced Performance version of the functions
This will help you to get additional performance benefit for math functions.
Also, performance data and graphs available at
etc would be useful to have an idea about performance of vector math functions
2) Modern processors can execute multiplication and addition add instructions in parallel, and Intel compiler can take advantage of that by proper scheduling of the instructions.
So, you might want to try using this piece of the code instead of vector Mul, Add, and Sqr. For example, please try this loop
volat2_temp, T, volat2_temp);
//compute numerator = (log(S / X) + (v * v / 2) * T)
vsAdd(ARRAYSIZE, log_temp, volat2_temp, numerator);
You also would receive better performance
results if you group as much such simple operations into one loop as possible
because the compiler will have better instruction scheduling possibilities.
3) Intel MKL math functions are expected to be threaded for vector length 16K, and this should give you additional performance benefit. Setting number of threads with Intel MKL service functionality would be probably useful as different functions are threaded differently on the same vector length. You might also want to apply a different approach by integrating parallelization into your application (this can be done, for examples, by using Open MP* directives); in this case, please use serial version of Intel MKL math functions.
Also, Intel MKL Manual suggests to call vector functions when vector length is at least several dozen elements. For small vector lengths, use of math functions available in Intel C++ compiler would be better choice.
4) You have some room for simplification of Balck-Schole formula (even more, if you consider that 2 of 5 arguments are constant)
5) During first call to Intel MKL functions the additional initialization is applied, thats why you see that the results for the 2-passes are better than for the 1-pass.
It is also worth noting that use of capabilities of Intel Compiler (such as vectorization, parallelization, architecture specific optimizations) in addition to features of Intel MKL would open more opportunities for performance gain on multi-core processors.
Please, let us know if you have more questions and comments on the optimization appraoches to the Black-Scholes benchmark, and we would gladely help.
In addition to Ilya's answer I'd suggest to have a look at VML & VSL training materials available at http://software.intel.com/en-us/articles/intel-mkl-vmlvsl-training-material/.
This set of the slides describes features of Vector Maths Functions and Statistical functionality available in Intel Math Kernel Library. Slides 28-30 contain description of optimization approaches to Black-Scholes formula and related performance data.
Also, some when in future we think about postingwhite articles which, in particular, would demonstrate Intel SW based optimization approaches to Black Scholes and Monte Carlo version of European option pricing problem. Code samples would be part of those publications.
Please, feel free to ask more questions on Vector Math and Stat features of Intel MKL, and we will help.
Illya, I have a similar problem. I see big difference between Sin which were computed by vdSin and sin() inside the loop. I use MS VS 2005 with Intel composer XE 2011 Update 6. Would you please say compiler's key for different accuracies of VML functions.
>>...both using standard C functions...
Your C++ prototype could be improved if C++ templates are used. You're duplicating codes for
'float' and 'double' data types. What if some time later you will need to do calculations for a'long double' datatype?
Please take a look at a prototype of the Black-Scholes Algorithm with C++ templates:
template < class T > class TBlackScholes
TBlackScholes( void )
virtual ~TBlackScholes( void )
virtual void RunTest( int iNumPasses )
void Init( void )
tPI = ( T )3.14159265358979323846;
tA1 = ( T ) 0.31938153;
tA2 = ( T )-0.356563782;
tA3 = ( T ) 1.781477937;
tA4 = ( T )-1.821255978;
tA5 = ( T ) 1.330274429;
tANeeded = ( T )0.3989423;
tKNeeded = ( T )0.2316419;
T Compute( char chFlag, T tS, T tX, T tT, T tR, T tV )
T ComputeCND( T tX )
void main( void )
// Test for 'float' datatype
TBlackScholes< float > fBS;
fBS.RunTest( 1 );
fBS.RunTest( 10 );
fBS.RunTest( 100 );
fBS.RunTest( 1000 );
fBS.RunTest( 10000 );
// Test for 'double' datatype
TBlackScholes< double > dBS;
dBS.RunTest( 1 );
dBS.RunTest( 10 );
dBS.RunTest( 100 );
dBS.RunTest( 1000 );
dBS.RunTest( 10000 );