<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>Intel® oneAPI Math Kernel LibraryのトピックMKL optimization problem: VML functions (sequential and threaded)</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-optimization-problem-VML-functions-sequential-and-threaded/m-p/819016#M4567</link>
    <description>Hi everyone,&lt;BR /&gt;&lt;BR /&gt;I'm having optimization problems with MKL. I'm not sure whether I'm doing somthing wrong, or there is indeed a problem in this case (aka. it won't have benefits in my case).&lt;BR /&gt;&lt;BR /&gt;I've made an implementation protype of the Black-Scholes algorithm for evaluating option prices, both using standard C functions, and MKL functions, by using the VML library. My problem is that the MKL implementation is much more slower than the normal float implementation. I've tried both single and multi threaded. Can someone please take a look and give me some advice/suggestion what else could I try? According to documentation this is a high-performance library. However, my results don't reflect this.&lt;BR /&gt;&lt;BR /&gt;I've attached the code. Just uncomment the mkl_domain_set_num_threads() function. Also the makefile contains both single and multi threaded libraries. You just have to uncomment the corresponding lines.&lt;BR /&gt;&lt;BR /&gt;Whenever I use Sequential linking:&lt;BR /&gt;&lt;B&gt;icpc -c -w1 -O2 -xsse4.2 -DMKL_ILP64 -I. -I/opt/intel/composerxe/include -I/opt/intel/mkl/include -o Black76.o Black76.cpp&lt;BR /&gt;icpc -c -w1 -O2 -xsse4.2 -DMKL_ILP64 -I. -I/opt/intel/composerxe/include -I/opt/intel/mkl/include -o main.o main.cpp&lt;BR /&gt;icpc -L/opt/intel/mkl/lib/intel64 -L/opt/intel/lib/intel64 Black76.o main.o -o black76_intel -lrt -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lm&lt;/B&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;I'm getting the following performance results:&lt;BR /&gt;Completed &lt;B&gt;1 passes in 0 : 001526118&lt;/B&gt; seconds&lt;BR /&gt;Completed &lt;B&gt;2 passes in 0 : 000007518&lt;/B&gt; seconds&lt;BR /&gt;Completed &lt;B&gt;3 passes in 0 : 000008536&lt;/B&gt; seconds&lt;BR /&gt;Completed &lt;B&gt;10 passes in 0 : 000026468&lt;/B&gt; seconds&lt;BR /&gt;Completed 100 passes in 0 : 000329301 seconds&lt;BR /&gt;Completed 1000 passes in 0 : 002591126 seconds&lt;BR /&gt;Completed 10000 passes in 0 : 014796280 seconds&lt;BR /&gt;Completed 100000 passes in 0 : 147133308 seconds&lt;BR /&gt;Completed 1000000 passes in 1 : 465677079 seconds&lt;BR /&gt;Completed 10000000 passes in 14 : 714433962 seconds&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;It's also something odd here, because running 2 passes should not be quicker than running only one pass? There is huge difference between the 2, also running 3 doesn't reflect the reality either. Running even 100 passes is even quicker than the first one? This shouldn't happen.&lt;BR /&gt;&lt;BR /&gt;When I compile with multi-threading I use the following options:&lt;BR /&gt;&lt;B&gt;icpc -L/opt/intel/mkl/lib/intel64 -L/opt/intel/lib/intel64 Black76.o main.o -o black76_intel -lrt -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm&lt;/B&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;I will have to make 16 calculation repeatedly, so I defined ARRAYSZE=16, but I also tried increasing ARRAYSIZE to 16000, and enable multi threading, still sequential was faster than multithreaded. I'd like to improve performance with 16 calculations.&lt;BR /&gt;&lt;BR /&gt;Can someone help me?&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Please advice,&lt;BR /&gt;&lt;BR /&gt;Thank you,&lt;BR /&gt;Eduard.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;</description>
    <pubDate>Wed, 01 Feb 2012 13:17:29 GMT</pubDate>
    <dc:creator>zeusz4u</dc:creator>
    <dc:date>2012-02-01T13:17:29Z</dc:date>
    <item>
      <title>MKL optimization problem: VML functions (sequential and threaded)</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-optimization-problem-VML-functions-sequential-and-threaded/m-p/819016#M4567</link>
      <description>Hi everyone,&lt;BR /&gt;&lt;BR /&gt;I'm having optimization problems with MKL. I'm not sure whether I'm doing somthing wrong, or there is indeed a problem in this case (aka. it won't have benefits in my case).&lt;BR /&gt;&lt;BR /&gt;I've made an implementation protype of the Black-Scholes algorithm for evaluating option prices, both using standard C functions, and MKL functions, by using the VML library. My problem is that the MKL implementation is much more slower than the normal float implementation. I've tried both single and multi threaded. Can someone please take a look and give me some advice/suggestion what else could I try? According to documentation this is a high-performance library. However, my results don't reflect this.&lt;BR /&gt;&lt;BR /&gt;I've attached the code. Just uncomment the mkl_domain_set_num_threads() function. Also the makefile contains both single and multi threaded libraries. You just have to uncomment the corresponding lines.&lt;BR /&gt;&lt;BR /&gt;Whenever I use Sequential linking:&lt;BR /&gt;&lt;B&gt;icpc -c -w1 -O2 -xsse4.2 -DMKL_ILP64 -I. -I/opt/intel/composerxe/include -I/opt/intel/mkl/include -o Black76.o Black76.cpp&lt;BR /&gt;icpc -c -w1 -O2 -xsse4.2 -DMKL_ILP64 -I. -I/opt/intel/composerxe/include -I/opt/intel/mkl/include -o main.o main.cpp&lt;BR /&gt;icpc -L/opt/intel/mkl/lib/intel64 -L/opt/intel/lib/intel64 Black76.o main.o -o black76_intel -lrt -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lm&lt;/B&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;I'm getting the following performance results:&lt;BR /&gt;Completed &lt;B&gt;1 passes in 0 : 001526118&lt;/B&gt; seconds&lt;BR /&gt;Completed &lt;B&gt;2 passes in 0 : 000007518&lt;/B&gt; seconds&lt;BR /&gt;Completed &lt;B&gt;3 passes in 0 : 000008536&lt;/B&gt; seconds&lt;BR /&gt;Completed &lt;B&gt;10 passes in 0 : 000026468&lt;/B&gt; seconds&lt;BR /&gt;Completed 100 passes in 0 : 000329301 seconds&lt;BR /&gt;Completed 1000 passes in 0 : 002591126 seconds&lt;BR /&gt;Completed 10000 passes in 0 : 014796280 seconds&lt;BR /&gt;Completed 100000 passes in 0 : 147133308 seconds&lt;BR /&gt;Completed 1000000 passes in 1 : 465677079 seconds&lt;BR /&gt;Completed 10000000 passes in 14 : 714433962 seconds&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;It's also something odd here, because running 2 passes should not be quicker than running only one pass? There is huge difference between the 2, also running 3 doesn't reflect the reality either. Running even 100 passes is even quicker than the first one? This shouldn't happen.&lt;BR /&gt;&lt;BR /&gt;When I compile with multi-threading I use the following options:&lt;BR /&gt;&lt;B&gt;icpc -L/opt/intel/mkl/lib/intel64 -L/opt/intel/lib/intel64 Black76.o main.o -o black76_intel -lrt -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm&lt;/B&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;I will have to make 16 calculation repeatedly, so I defined ARRAYSZE=16, but I also tried increasing ARRAYSIZE to 16000, and enable multi threading, still sequential was faster than multithreaded. I'd like to improve performance with 16 calculations.&lt;BR /&gt;&lt;BR /&gt;Can someone help me?&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Please advice,&lt;BR /&gt;&lt;BR /&gt;Thank you,&lt;BR /&gt;Eduard.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Wed, 01 Feb 2012 13:17:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-optimization-problem-VML-functions-sequential-and-threaded/m-p/819016#M4567</guid>
      <dc:creator>zeusz4u</dc:creator>
      <dc:date>2012-02-01T13:17:29Z</dc:date>
    </item>
    <item>
      <title>MKL optimization problem: VML functions (sequential and threade</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-optimization-problem-VML-functions-sequential-and-threaded/m-p/819017#M4568</link>
      <description>&lt;P&gt;Hi Eduard,&lt;/P&gt;&lt;P&gt;thanks for your question, the code
and detailed description of your environment.&lt;/P&gt;&lt;P&gt;We have several comments for your code which
may help to improve the performance of your Black-Scholes benchmark:&lt;/P&gt;&lt;P&gt;1) By default, Intel Math Kernel Library runs
High Accuracy version of Vector Math functions, while Compiler deafult is Lower Accuracy versions.&lt;/P&gt;

&lt;P&gt;If your application does not require this level
of accuracy, you might want to relax it using vmlSetMode as shown below:&lt;/P&gt;

&lt;P&gt;vmlSetMode(VML_LA); // use Lower Accuracy
version of the functions&lt;/P&gt;

&lt;P&gt;or even&lt;/P&gt;

&lt;P&gt;vmlSetMode(VML_EP); // using Enhanced
Performance version of the functions&lt;/P&gt;

&lt;P&gt;This will help you to get additional
performance benefit for math functions.&lt;/P&gt;&lt;P&gt;Also, performance data and graphs available at&lt;/P&gt;

&lt;P&gt;&lt;A href="http://software.intel.com/sites/products/documentation/hpc/mkl/vml/functions/_performanceall.html"&gt;http://software.intel.com/sites/products/documentation/hpc/mkl/vml/functions/_performanceall.html&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;&lt;A href="http://software.intel.com/sites/products/documentation/hpc/mkl/vml/functions/exp.html"&gt;http://software.intel.com/sites/products/documentation/hpc/mkl/vml/functions/exp.html&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;etc would be useful to have an idea about
performance of vector math functions&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;

&lt;P&gt;2) Modern processors can execute multiplication
and addition add instructions in parallel, and Intel compiler can take
advantage of that by proper scheduling of the instructions.&lt;/P&gt;

&lt;P&gt;So, you might want to try using this piece of
the code instead of vector Mul, Add, and Sqr. For example, please try this loop&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;

&lt;P&gt; for(j=0;j&amp;lt;
numPasses;j++)&lt;BR /&gt;        {&lt;BR /&gt;               volat2_temp&lt;J&gt;
= volat2_temp&lt;J&gt;*T;&lt;BR /&gt;               Numerator&lt;J&gt;
= log_temp&lt;J&gt; + volat2_temp&lt;J&gt;;&lt;BR /&gt;        }&lt;/J&gt;&lt;/J&gt;&lt;/J&gt;&lt;/J&gt;&lt;/J&gt;&lt;/P&gt;

&lt;P&gt;instead of &lt;/P&gt;

&lt;P&gt; vsMul(ARRAYSIZE,
volat2_temp, T, volat2_temp);&lt;BR /&gt;        //compute
numerator = (log(S / X) + (v * v / 2) * T)&lt;BR /&gt;        vsAdd(ARRAYSIZE,
log_temp, volat2_temp, numerator);&lt;/P&gt;

&lt;P&gt;You also would receive better performance
results if you group as much such simple operations into one loop as possible
because the compiler will have better instruction scheduling possibilities.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;3) Intel MKL math functions are expected to be
threaded for vector length 16K, and this should give you additional performance
benefit. Setting number of threads with Intel MKL service functionality would
be probably useful as different functions are threaded differently on the same
vector length. You might also want to apply a different approach by integrating
parallelization into your application (this can be done, for examples, by using
Open MP* directives); in this case, please use serial version of Intel MKL
math functions.&lt;/P&gt;

&lt;P&gt;Also, Intel MKL Manual suggests to call vector
functions when vector length is at least several dozen elements. For small
vector lengths, use of math functions available in Intel C++ compiler would be
better choice.&lt;/P&gt;

&lt;P&gt;4) You have some room for simplification of Balck-Schole
formula (even more, if you consider that 2 of 5 arguments are constant)&lt;/P&gt;

&lt;P&gt;5) During first call to Intel MKL functions the additional
initialization is applied, thats why you see that the results for the 2-passes are better than for the 1-pass.&lt;/P&gt;

&lt;P&gt;It is also worth noting that use of
capabilities of Intel Compiler (such as vectorization, parallelization,
architecture specific optimizations) in addition to features of Intel MKL
would open more opportunities for performance gain on multi-core processors.&lt;/P&gt;

&lt;P&gt;Please, let us know if you have more
questions and comments on the optimization appraoches to the Black-Scholes
benchmark, and we would gladely help.&lt;/P&gt;</description>
      <pubDate>Thu, 02 Feb 2012 06:59:37 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-optimization-problem-VML-functions-sequential-and-threaded/m-p/819017#M4568</guid>
      <dc:creator>Ilya_B_Intel</dc:creator>
      <dc:date>2012-02-02T06:59:37Z</dc:date>
    </item>
    <item>
      <title>MKL optimization problem: VML functions (sequential and threade</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-optimization-problem-VML-functions-sequential-and-threaded/m-p/819018#M4569</link>
      <description>Hello Ilya,&lt;BR /&gt;&lt;BR /&gt;And I'd like to thank you for the detailed explanation and exemplification.&lt;BR /&gt;&lt;BR /&gt;1) I will try setting the compiler to different accuracy levels, I'm really curios about the accuracy of the results, as well as the execution time.&lt;BR /&gt;&lt;BR /&gt;2)I think at point 2 you meant:&lt;BR /&gt;for(j=0;j&amp;lt; &lt;B&gt;ARRAYLENGTH&lt;/B&gt;;j++)&lt;BR /&gt;        {&lt;BR /&gt;               volat2_temp&lt;J&gt;
= volat2_temp&lt;J&gt;*T;&lt;BR /&gt;               Numerator&lt;J&gt;
= log_temp&lt;J&gt; + volat2_temp&lt;J&gt;;&lt;BR /&gt;        }&lt;BR /&gt;&lt;BR /&gt;This is what I wanted to try next, to use only the MKL exp, log, sqrt, and cnd, and use regular arithmetic functions for +, -, *, and /. Maybe DIV could be also used from the MKL library.&lt;BR /&gt;&lt;BR /&gt;Basically I want to measure how long does it take to make those 16 calculations when I do 1 pass, 10 passes, 100 passes, and just want to see the real performance of the calculation in these cases.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;3) Does it make sense to use &lt;B&gt;#pragma omp parallel sections&lt;/B&gt; to indicate a parrallel region (4 threads, each working on 4-element arrays)? I've had another implementation, using the math.h functions. I tried using OpenMP there, but the result wasn't good at all. On the other hand, I've seen examples of using parrallel sections for the QuickSort algorithm, so it should be something similar. Maybe I could use #pragma omp parallel for for the above example of element-by-element multiplications and additions.&lt;BR /&gt;&lt;BR /&gt;Also, Does it help to turn hyper-threading off, and use a real-time kernel instead of the regular one? This is what I also want to try next.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;On more remark here:&lt;BR /&gt;4) I've used constants in this simulation, but I'm not sure if it's gonna be the same in a real-time environment. Actually it's inaccurate, because calculations should be made for Options having the same Expiry date, so vector T should be constant. I will check the other parameters as well, and try to get some real data for the simulation.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;I really appreciate your response, as I'm new in MKL programming and Intel CPU programming as well. So all you said is a great help to me.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;One more question I'd like to ask you (I don't know if it's the place place to do so, but it worth a try), I've found a presentation on the internet made by Heinz Bast, Technical Consulting Engineer, Software Development Products, Intel Corporation: entitled &lt;B&gt;A Case Study: Using Intel Parallel Studio XE to Optimize Black Scholes Calculation&lt;/B&gt;. In the PDF file it's mentioned that the source code(s) can be freely obtained upon request from the presenter. Can I dowload it from somewhere? It would bee a good reference to see a highly optimized Black Scholes algorithm.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Thank you, and I'm looking forward to get a response to the above questions (or at least some of them).&lt;BR /&gt;&lt;BR /&gt;Eduard.&lt;/J&gt;&lt;/J&gt;&lt;/J&gt;&lt;/J&gt;&lt;/J&gt;</description>
      <pubDate>Thu, 02 Feb 2012 08:18:08 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-optimization-problem-VML-functions-sequential-and-threaded/m-p/819018#M4569</guid>
      <dc:creator>zeusz4u</dc:creator>
      <dc:date>2012-02-02T08:18:08Z</dc:date>
    </item>
    <item>
      <title>MKL optimization problem: VML functions (sequential and threade</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-optimization-problem-VML-functions-sequential-and-threaded/m-p/819019#M4570</link>
      <description>&lt;DIV&gt;I would say, that for most platforms threading of BS formula with vector length 16 is not really reasonable -threading overhead will overcome all benefits. If you may need larger computations it will make sense (and either &lt;B&gt;#pragma omp parallel sections&lt;/B&gt; or&lt;B&gt;#pragma omp parallel for&lt;/B&gt; can be used fine). On MKL side, no VML function is threaded on vectorlengths less than 100.&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;I saw some visible benefits fromhyper-threading turned on on Black Scholes benchmark, though, again, vector lengths were much higher than 16.&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;You can also consider using vsInvSqrt instead ofvsSqrt +vsDiv, try to limit number of divisions (BS formula with 2 fixed arguments and 3 array arguments can be done with only 1 vsDiv), consider usingvsErf instead of vsCdfNorm (because of some mathematical properties of those functions, sometimes it is quicker to do Erf+scaling than CdfNorm).&lt;/DIV&gt;</description>
      <pubDate>Thu, 02 Feb 2012 11:49:57 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-optimization-problem-VML-functions-sequential-and-threaded/m-p/819019#M4570</guid>
      <dc:creator>Ilya_B_Intel</dc:creator>
      <dc:date>2012-02-02T11:49:57Z</dc:date>
    </item>
    <item>
      <title>MKL optimization problem: VML functions (sequential and threade</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-optimization-problem-VML-functions-sequential-and-threaded/m-p/819020#M4571</link>
      <description>&lt;P&gt;Hi Eduard,&lt;BR /&gt;&lt;BR /&gt;In addition to Ilya's answer I'd suggest to have a look at VML &amp;amp; VSL training materials available at &lt;A href="http://software.intel.com/en-us/articles/intel-mkl-vmlvsl-training-material/" target="_blank"&gt;http://software.intel.com/en-us/articles/intel-mkl-vmlvsl-training-material/&lt;/A&gt;.&lt;BR /&gt;This set of the slides describes features of Vector Maths Functions and Statistical functionality available in Intel Math Kernel Library. Slides 28-30 contain description of optimization approaches to Black-Scholes formula and related performance data.&lt;BR /&gt;Also, some when in future we think about postingwhite articles which, in particular, would demonstrate Intel SW based optimization approaches to Black Scholes and Monte Carlo version of European option pricing problem. Code samples would be part of those publications.&lt;BR /&gt;&lt;BR /&gt;Please, feel free to ask more questions on Vector Math and Stat features of Intel MKL, and we will help.&lt;BR /&gt;&lt;BR /&gt;Thanks,&lt;BR /&gt;Andrey&lt;/P&gt;</description>
      <pubDate>Fri, 03 Feb 2012 07:31:30 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-optimization-problem-VML-functions-sequential-and-threaded/m-p/819020#M4571</guid>
      <dc:creator>Andrey_N_Intel</dc:creator>
      <dc:date>2012-02-03T07:31:30Z</dc:date>
    </item>
    <item>
      <title>MKL optimization problem: VML functions (sequential and threade</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-optimization-problem-VML-functions-sequential-and-threaded/m-p/819021#M4572</link>
      <description>&lt;P&gt;Illya, I have a similar problem. I see big difference between Sin which were computed by vdSin and sin() inside the loop. I use MS VS 2005 with Intel composer XE 2011 Update 6. Would you please say compiler's key for different accuracies of VML functions.&lt;/P&gt;&lt;P&gt;Thanks,&lt;/P&gt;&lt;P&gt;Dmitry.&lt;/P&gt;</description>
      <pubDate>Fri, 03 Feb 2012 09:26:53 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-optimization-problem-VML-functions-sequential-and-threaded/m-p/819021#M4572</guid>
      <dc:creator>dmitry_k</dc:creator>
      <dc:date>2012-02-03T09:26:53Z</dc:date>
    </item>
    <item>
      <title>MKL optimization problem: VML functions (sequential and threade</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-optimization-problem-VML-functions-sequential-and-threaded/m-p/819022#M4573</link>
      <description>&lt;DIV&gt;Dmitry,&lt;/DIV&gt;&lt;BR /&gt;On MKL side, you can controll accuracy with special functions call:&lt;DIV&gt;vmlSetMode([VML_HA|VML_LA|VML_EP])&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;On Intel Compiler vectorized math functions side, you can control accuracy with swirches:&lt;/DIV&gt;&lt;DIV&gt;-fimf-precision=[high|medium|low]&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;See Compiler doc for more details:&lt;/DIV&gt;&lt;DIV&gt;&lt;A href="http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/cpp/lin/copts/common_options/option_fimf_precision.htm"&gt;http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/cpp/lin/copts/common_options/option_fimf_precision.htm&lt;/A&gt;&lt;/DIV&gt;</description>
      <pubDate>Fri, 03 Feb 2012 11:06:50 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-optimization-problem-VML-functions-sequential-and-threaded/m-p/819022#M4573</guid>
      <dc:creator>Ilya_B_Intel</dc:creator>
      <dc:date>2012-02-03T11:06:50Z</dc:date>
    </item>
    <item>
      <title>MKL optimization problem: VML functions (sequential and threade</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-optimization-problem-VML-functions-sequential-and-threaded/m-p/819023#M4574</link>
      <description>Ilya, thanks a lot.</description>
      <pubDate>Sat, 04 Feb 2012 13:16:05 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-optimization-problem-VML-functions-sequential-and-threaded/m-p/819023#M4574</guid>
      <dc:creator>dmitry_k</dc:creator>
      <dc:date>2012-02-04T13:16:05Z</dc:date>
    </item>
    <item>
      <title>MKL optimization problem: VML functions (sequential and threade</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-optimization-problem-VML-functions-sequential-and-threaded/m-p/819024#M4575</link>
      <description>Hello Ilya and Andrey,&lt;BR /&gt;&lt;BR /&gt;I'd like to thank you for the detailed instructions. Using Erf instead of CdfNorm, and replacing Add, Sub, Mul and Div with for loops, as well as setting accuracy to LA, had considerably improved program execution time.&lt;BR /&gt;&lt;BR /&gt;You have been a great help.&lt;BR /&gt;&lt;BR /&gt;I'm now looking at ArBB implemkentation, there is a black-scholes exmple included with the installation kit. I hope this one will be even better than MKL.&lt;BR /&gt;&lt;BR /&gt;Eduard.</description>
      <pubDate>Mon, 06 Feb 2012 13:28:03 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-optimization-problem-VML-functions-sequential-and-threaded/m-p/819024#M4575</guid>
      <dc:creator>zeusz4u</dc:creator>
      <dc:date>2012-02-06T13:28:03Z</dc:date>
    </item>
    <item>
      <title>MKL optimization problem: VML functions (sequential and threade</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-optimization-problem-VML-functions-sequential-and-threaded/m-p/819025#M4576</link>
      <description>I have another question/concern, in wich you may be able to help me out.&lt;BR /&gt;It's still Black Scholes related, but not MKL.&lt;BR /&gt;&lt;BR /&gt;I tried to use a different abbordation to the problem, and incorporate both float and double tests.&lt;BR /&gt;&lt;BR /&gt;Please check my code whenever you may have some free time. In this version I'm still getting some odd results. &lt;BR /&gt;&lt;BR /&gt;I'm now using SPAN data samples from Chicago Mercantile Exchange, however I seem to have the same problem with execution times. Which one is to be trusted at this time? Here is the output:&lt;BR /&gt;&lt;BR /&gt;------------ Running Black76 Software benchmark ------------&lt;BR /&gt;RUNNING FLOAT TEST&lt;BR /&gt;Completed 1 passes in 0 : &lt;B&gt;000009731 &lt;/B&gt;seconds&lt;BR /&gt;Completed 2 passes in 0 : 000001850 seconds&lt;BR /&gt;Completed 3 passes in 0 : 000002363 seconds&lt;BR /&gt;Completed 4 passes in 0 : 000003015 seconds&lt;BR /&gt;Completed 5 passes in 0 : 000003615 seconds&lt;BR /&gt;Completed 10 passes in 0 : 000006321 seconds&lt;BR /&gt;Completed 20 passes in 0 : 000013203 seconds&lt;BR /&gt;Completed 50 passes in 0 : 000029405 seconds&lt;BR /&gt;Completed 100 passes in 0 : 000058387 seconds&lt;BR /&gt;Completed 1000 passes in 0 : 000595579 seconds&lt;BR /&gt;Completed 10000 passes in 0 : 005931292 seconds&lt;BR /&gt;Completed 100000 passes in 0 : 042001457 seconds&lt;BR /&gt;RUNNING DOUBLE TEST&lt;BR /&gt;Completed 1 passes in 0 : &lt;B&gt;000012223 &lt;/B&gt;seconds&lt;BR /&gt;Completed 2 passes in 0 : 000004050 seconds&lt;BR /&gt;Completed 3 passes in 0 : 000005351 seconds&lt;BR /&gt;Completed 4 passes in 0 : 000006255 seconds&lt;BR /&gt;Completed 5 passes in 0 : 000007663 seconds&lt;BR /&gt;Completed 10 passes in 0 : 000014439 seconds&lt;BR /&gt;Completed 20 passes in 0 : 000027367 seconds&lt;BR /&gt;Completed 50 passes in 0 : 000067247 seconds&lt;BR /&gt;Completed 100 passes in 0 : 000133634 seconds&lt;BR /&gt;Completed 1000 passes in 0 : 001369065 seconds&lt;BR /&gt;Completed 10000 passes in 0 : 013721015 seconds&lt;BR /&gt;Completed 100000 passes in 0 : 086685354 seconds&lt;BR /&gt;&lt;BR /&gt;My concerc is the first pass, when again I'm getting much higher execution time, than later on. And I'm not using any MKL functions at this time. Is it still necessarry for the Intel compiler to make some initializations at first call of math.h functions? Or is it related to the fact that the sample data is the same in later passes? Can we trust these results? Please advice. I attached both the code and the Makefile. It outputs the result into a .csv file, and also execution time is displayed in the console.&lt;BR /&gt;&lt;BR /&gt;I tried both -O2 and -O3 compiler options, the results are pretty much the same.</description>
      <pubDate>Thu, 09 Feb 2012 19:35:27 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-optimization-problem-VML-functions-sequential-and-threaded/m-p/819025#M4576</guid>
      <dc:creator>zeusz4u</dc:creator>
      <dc:date>2012-02-09T19:35:27Z</dc:date>
    </item>
    <item>
      <title>MKL optimization problem: VML functions (sequential and threade</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-optimization-problem-VML-functions-sequential-and-threaded/m-p/819026#M4577</link>
      <description>Eduard,&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;You are looking at effect of cold cache.&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;I see that you are using different output arrays in your "warming" run and "real" run.&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;DIV id="_mcePaste"&gt;	float &lt;B&gt;test&lt;/B&gt;[ARRAYSIZE];&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;	...&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt; &lt;B&gt;test&lt;J&gt;&lt;/J&gt;&lt;/B&gt; = compute_Black76_float('C', S&lt;J&gt;, X&lt;J&gt;, T&lt;J&gt;, R&lt;J&gt;, V&lt;J&gt;);&lt;/J&gt;&lt;/J&gt;&lt;/J&gt;&lt;/J&gt;&lt;/J&gt;&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;&lt;DIV id="_mcePaste"&gt;float &lt;B&gt;result&lt;/B&gt;[ARRAYSIZE];&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;...&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt; &lt;B&gt;result&lt;J&gt;&lt;/J&gt;&lt;/B&gt; = compute_Black76_float('C', S&lt;J&gt;, X&lt;J&gt;, T&lt;J&gt;, R&lt;J&gt;, V&lt;J&gt;);&lt;/J&gt;&lt;/J&gt;&lt;/J&gt;&lt;/J&gt;&lt;/J&gt;&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;That results in the following: during the first function call after the warming run your input arrays are in cache, but your real output array is not in cache yet.&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;The next time you run the same function (and it does not matter how many passes will be requested) &lt;B&gt;result&lt;/B&gt; array is located in exactly same place on stack, and that turn to be already in cache.&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;Which performance result will be more relevant in your case depends on final application usage model (will results and input array be in cache before the function call or not). It is defenitely worth trying to keep them in cache.&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;Ilya&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Fri, 10 Feb 2012 08:46:12 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-optimization-problem-VML-functions-sequential-and-threaded/m-p/819026#M4577</guid>
      <dc:creator>Ilya_B_Intel</dc:creator>
      <dc:date>2012-02-10T08:46:12Z</dc:date>
    </item>
    <item>
      <title>MKL optimization problem: VML functions (sequential and threade</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-optimization-problem-VML-functions-sequential-and-threaded/m-p/819027#M4578</link>
      <description>Thank you Ilya.&lt;BR /&gt;&lt;BR /&gt;So no matter what input data do I use (even if I use different set of input data in each run), should I use result vector in the warming run as well? &lt;BR /&gt;&lt;BR /&gt;Can I just initialize each element with value 0 for example? &lt;BR /&gt;&lt;BR /&gt;I made the warming run with a different vector, I thought maybe the calculate_Black76_float() function might need some initialization, so in this way it is copied onto the stack and remains there throughout the execution time of the program, but it seems I was wrong. So in order to get real-time measurment, I understand that in the warming run we should use the result vector, or at least it needs to be initialized with some value to keep it in the stack before the real-time measurements begin.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Fri, 10 Feb 2012 13:41:14 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-optimization-problem-VML-functions-sequential-and-threaded/m-p/819027#M4578</guid>
      <dc:creator>zeusz4u</dc:creator>
      <dc:date>2012-02-10T13:41:14Z</dc:date>
    </item>
    <item>
      <title>MKL optimization problem: VML functions (sequential and threade</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-optimization-problem-VML-functions-sequential-and-threaded/m-p/819028#M4579</link>
      <description>&lt;P&gt;&lt;STRONG&gt;&amp;gt;&amp;gt;...both using standard C functions...&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Your &lt;STRONG&gt;C++&lt;/STRONG&gt; prototype could be improved if &lt;STRONG&gt;C++ templates&lt;/STRONG&gt; are used. You're duplicating codes for&lt;BR /&gt;'float' and 'double' data types. What if some time later you will need to do calculations for a'long double' datatype?&lt;BR /&gt;&lt;BR /&gt;Please take a look at a prototype of the &lt;STRONG&gt;Black-Scholes&lt;/STRONG&gt; Algorithm with &lt;STRONG&gt;C++ templates&lt;/STRONG&gt;:&lt;BR /&gt;&lt;BR /&gt;...&lt;BR /&gt;template &amp;lt; class T &amp;gt; class &lt;STRONG&gt;TBlackScholes&lt;/STRONG&gt;&lt;BR /&gt;{&lt;BR /&gt;public:&lt;BR /&gt; &lt;STRONG&gt;TBlackScholes&lt;/STRONG&gt;( void )&lt;BR /&gt; {&lt;BR /&gt; Init();&lt;BR /&gt; };&lt;BR /&gt; virtual ~&lt;STRONG&gt;TBlackScholes&lt;/STRONG&gt;( void )&lt;BR /&gt; {&lt;BR /&gt; };&lt;/P&gt;&lt;P&gt; virtual void &lt;STRONG&gt;RunTest&lt;/STRONG&gt;( int iNumPasses )&lt;BR /&gt; {&lt;BR /&gt; //...&lt;BR /&gt; };&lt;/P&gt;&lt;P&gt;private:&lt;BR /&gt; void &lt;STRONG&gt;Init&lt;/STRONG&gt;( void )&lt;BR /&gt; {&lt;BR /&gt; tPI = ( T )3.14159265358979323846;&lt;/P&gt;&lt;P&gt; tA1 = ( T ) 0.31938153;&lt;BR /&gt; tA2 = ( T )-0.356563782;&lt;BR /&gt; tA3 = ( T ) 1.781477937;&lt;BR /&gt; tA4 = ( T )-1.821255978;&lt;BR /&gt; tA5 = ( T ) 1.330274429;&lt;BR /&gt; tANeeded = ( T )0.3989423;&lt;BR /&gt; tKNeeded = ( T )0.2316419;&lt;BR /&gt; };&lt;/P&gt;&lt;P&gt; T &lt;STRONG&gt;Compute&lt;/STRONG&gt;( char chFlag, T tS, T tX, T tT, T tR, T tV )&lt;BR /&gt; {&lt;BR /&gt; //...&lt;BR /&gt; };&lt;/P&gt;&lt;P&gt; T &lt;STRONG&gt;ComputeCND&lt;/STRONG&gt;( T tX )&lt;BR /&gt; {&lt;BR /&gt; //...&lt;BR /&gt; };&lt;/P&gt;&lt;P&gt;private:&lt;BR /&gt; T tPI;&lt;/P&gt;&lt;P&gt; T tA1;&lt;BR /&gt; T tA2;&lt;BR /&gt; T tA3;&lt;BR /&gt; T tA4;&lt;BR /&gt; T tA5;&lt;BR /&gt; T tANeeded;&lt;BR /&gt; T tKNeeded;&lt;BR /&gt;};&lt;BR /&gt;...&lt;/P&gt;&lt;P&gt;void &lt;STRONG&gt;main&lt;/STRONG&gt;( void )&lt;BR /&gt;{&lt;BR /&gt; ...&lt;BR /&gt; // Test for '&lt;STRONG&gt;float&lt;/STRONG&gt;' datatype&lt;BR /&gt; &lt;STRONG&gt;TBlackScholes&lt;/STRONG&gt;&amp;lt; &lt;SPAN style="text-decoration: underline;"&gt;float&lt;/SPAN&gt; &amp;gt; fBS;&lt;/P&gt;&lt;P&gt; fBS.RunTest(  1 );&lt;BR /&gt; fBS.RunTest(  10 );&lt;BR /&gt; fBS.RunTest(  100 );&lt;BR /&gt; fBS.RunTest(  1000 );&lt;BR /&gt; fBS.RunTest( 10000 );&lt;/P&gt;&lt;P&gt; // Test for '&lt;STRONG&gt;double&lt;/STRONG&gt;' datatype&lt;BR /&gt; &lt;STRONG&gt;TBlackScholes&lt;/STRONG&gt;&amp;lt; &lt;SPAN style="text-decoration: underline;"&gt;double&lt;/SPAN&gt; &amp;gt; dBS;&lt;/P&gt;&lt;P&gt; dBS.RunTest(  1 );&lt;BR /&gt; dBS.RunTest(  10 );&lt;BR /&gt; dBS.RunTest(  100 );&lt;BR /&gt; dBS.RunTest( 1000 );&lt;BR /&gt; dBS.RunTest( 10000 );&lt;BR /&gt; ...&lt;BR /&gt;}&lt;BR /&gt;&lt;BR /&gt;Best regartds,&lt;BR /&gt;Sergey&lt;/P&gt;</description>
      <pubDate>Mon, 13 Feb 2012 16:47:55 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-optimization-problem-VML-functions-sequential-and-threaded/m-p/819028#M4579</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2012-02-13T16:47:55Z</dc:date>
    </item>
  </channel>
</rss>

