<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Scaling on multi-core Xeon CPU  in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Scaling-on-multi-core-Xeon-CPU/m-p/852101#M6629</link>
    <description>&lt;DIV style="margin: 0px; height: auto;"&gt;&lt;/DIV&gt;
If you link with libiompprof5 (/Qopenmp_profile, if using ICL or IFORT), a performance summary of each threaded region should be written in guide.gvs. This will show the balance of work and barrier times among threads. Compare with various appropriate settings of KMP_AFFINITY environment variable, e.g. SET KMP_AFFINITY=compact,0,verbose. Check the echo to see that the core numbering has been understood.&lt;BR /&gt;For 2,4,6 threads try also with the threads split 50-50 between processors, but not alternating.&lt;BR /&gt;For a GUI display, the guide.gvs may be imported into VTune, or Thread Profiler could be used.&lt;BR /&gt;VTune or PTU event sampling should enable you to get more detail, assuming there is cache capacity limitation. The reduced cache size and memory bus capacity of the 5506, compared with full featured models, may become a handicap at some problem size in ?gemm. Then, it would be interesting to compare the results when you use ifort to compile your ?gemm from public source with debug symbols enabled.&lt;BR /&gt;</description>
    <pubDate>Wed, 30 Dec 2009 17:33:30 GMT</pubDate>
    <dc:creator>TimP</dc:creator>
    <dc:date>2009-12-30T17:33:30Z</dc:date>
    <item>
      <title>Scaling on multi-core Xeon CPU</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Scaling-on-multi-core-Xeon-CPU/m-p/852100#M6628</link>
      <description>I am testing on a dual , quad-core Xeon 5506 (64 bit MKL 10.2, Windows XP64) Profiling the code in sequential mode shows that a very significant amount of time is spent inside MKL inside *gemm3, so I was expecting some decent scaling as I increased OMP_NUM_THREADS from 1 to 8&lt;BR /&gt;&lt;BR /&gt;Number of threads	Elapsed Time	Process Time&lt;BR /&gt;1	820.577	819.703&lt;BR /&gt;2	596.265	1035.69&lt;BR /&gt;3	527.077	1350.08&lt;BR /&gt;4	491.907	1640.97&lt;BR /&gt;5	475.305	1856.08&lt;BR /&gt;6	460.596	2097.59&lt;BR /&gt;7	454.632	2312.84&lt;BR /&gt;8	449.244	2623.67&lt;BR /&gt;&lt;BR /&gt;I just don't really understand this. It appears MKL is keeping all 8 cores 'busy', but they must be just spinning their wheels. Are there any Intel tools that will help me figure out what is going on here?&lt;BR /&gt;&lt;BR /&gt;Andrew&lt;BR /&gt;</description>
      <pubDate>Wed, 30 Dec 2009 16:34:45 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Scaling-on-multi-core-Xeon-CPU/m-p/852100#M6628</guid>
      <dc:creator>AndrewC</dc:creator>
      <dc:date>2009-12-30T16:34:45Z</dc:date>
    </item>
    <item>
      <title>Re: Scaling on multi-core Xeon CPU</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Scaling-on-multi-core-Xeon-CPU/m-p/852101#M6629</link>
      <description>&lt;DIV style="margin: 0px; height: auto;"&gt;&lt;/DIV&gt;
If you link with libiompprof5 (/Qopenmp_profile, if using ICL or IFORT), a performance summary of each threaded region should be written in guide.gvs. This will show the balance of work and barrier times among threads. Compare with various appropriate settings of KMP_AFFINITY environment variable, e.g. SET KMP_AFFINITY=compact,0,verbose. Check the echo to see that the core numbering has been understood.&lt;BR /&gt;For 2,4,6 threads try also with the threads split 50-50 between processors, but not alternating.&lt;BR /&gt;For a GUI display, the guide.gvs may be imported into VTune, or Thread Profiler could be used.&lt;BR /&gt;VTune or PTU event sampling should enable you to get more detail, assuming there is cache capacity limitation. The reduced cache size and memory bus capacity of the 5506, compared with full featured models, may become a handicap at some problem size in ?gemm. Then, it would be interesting to compare the results when you use ifort to compile your ?gemm from public source with debug symbols enabled.&lt;BR /&gt;</description>
      <pubDate>Wed, 30 Dec 2009 17:33:30 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Scaling-on-multi-core-Xeon-CPU/m-p/852101#M6629</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2009-12-30T17:33:30Z</dc:date>
    </item>
    <item>
      <title>Re: Scaling on multi-core Xeon CPU</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Scaling-on-multi-core-Xeon-CPU/m-p/852102#M6630</link>
      <description>&lt;DIV style="margin:0px;"&gt;
&lt;DIV id="quote_reply" style="width: 100%; margin-top: 5px;"&gt;
&lt;DIV style="margin-left:2px;margin-right:2px;"&gt;Quoting - &lt;A href="https://community.intel.com/en-us/profile/371334"&gt;vasci_intel&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"&gt;&lt;EM&gt;I am testing on a dual , quad-core Xeon 5506 (64 bit MKL 10.2, Windows XP64) Profiling the code in sequential mode shows that a very significant amount of time is spent inside MKL inside *gemm3, so I was expecting some decent scaling as I increased OMP_NUM_THREADS from 1 to 8&lt;BR /&gt;&lt;BR /&gt;Number of threads	Elapsed Time	Process Time&lt;BR /&gt;1	820.577	819.703&lt;BR /&gt;2	596.265	1035.69&lt;BR /&gt;3	527.077	1350.08&lt;BR /&gt;4	491.907	1640.97&lt;BR /&gt;5	475.305	1856.08&lt;BR /&gt;6	460.596	2097.59&lt;BR /&gt;7	454.632	2312.84&lt;BR /&gt;8	449.244	2623.67&lt;BR /&gt;&lt;BR /&gt;I just don't really understand this. It appears MKL is keeping all 8 cores 'busy', but they must be just spinning their wheels. Are there any Intel tools that will help me figure out what is going on here?&lt;BR /&gt;&lt;BR /&gt;Andrew&lt;BR /&gt;&lt;/EM&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;BR /&gt;Regarding scaling  you may try to vary KMP_BLOCKTIME variable to make OpenMP threads react faster  by increasing this variable. &lt;BR /&gt;While smaller block time may increase performance for non-OpenMP threaded code between regions.  &lt;BR /&gt;&lt;BR /&gt;But in many cases on FSB based systems  the application bandwidth appetite causes a non-scaling. &lt;BR /&gt;If your Xeon 5056 is FSB based? &lt;BR /&gt;PTU is able to pinpoint the bandwidth problem on FSB based systems (with Core2 Bandwidth profile configuration )&lt;BR /&gt;As well as show the run-time balance between threads and other micro-architectural issues. &lt;BR /&gt;&lt;BR /&gt;On the other side I doubt that this is gemm causing non-scaling so rapidly  (at 3-4 threads)  as it does multiplication by blocks.&lt;BR /&gt;&lt;BR /&gt;BTW what are matrix sizes in your test?&lt;BR /&gt;&lt;BR /&gt;--Gennady&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Wed, 30 Dec 2009 19:35:28 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Scaling-on-multi-core-Xeon-CPU/m-p/852102#M6630</guid>
      <dc:creator>Gennady_F_Intel</dc:creator>
      <dc:date>2009-12-30T19:35:28Z</dc:date>
    </item>
    <item>
      <title>Re: Scaling on multi-core Xeon CPU</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Scaling-on-multi-core-Xeon-CPU/m-p/852103#M6631</link>
      <description>&lt;DIV style="margin:0px;"&gt;&lt;/DIV&gt;
Xeon 5506 is a Nehalem architecture with reduced QPI performance, half normal cache size, and no HyperThreading or Turbo mode. If the MKL blocking is effective for the smaller cache, and KMP_AFFINITY=compact is set, I would think that could compensate for the slower QPI. Does MKL assume that all Nehalem platforms have the larger cache?&lt;BR /&gt;I spent some time trying to understand Gennady's comment about KMP_BLOCKTIME. I think he means that small matrices, with relatively short parallel execution times, and significant serial execution times between, could benefit from reduced KMP_BLOCKTIME. But I suppose the matrix has to be fairly large to use 8 threads, as your timings indicate may be happening. I assume OP would tell us if _OMP_NESTED were in use.&lt;BR /&gt;As Gennady said, it's not possible to make relevant comments without knowing whether the question is about many small matrix operations or large enough ones that cache size becomes an issue.&lt;BR /&gt;</description>
      <pubDate>Thu, 31 Dec 2009 14:28:23 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Scaling-on-multi-core-Xeon-CPU/m-p/852103#M6631</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2009-12-31T14:28:23Z</dc:date>
    </item>
    <item>
      <title>Re: Scaling on multi-core Xeon CPU</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Scaling-on-multi-core-Xeon-CPU/m-p/852104#M6632</link>
      <description>&lt;DIV style="margin:0px;"&gt;
&lt;DIV id="quote_reply" style="width: 100%; margin-top: 5px;"&gt;
&lt;DIV style="margin-left:2px;margin-right:2px;"&gt;Quoting - &lt;A href="https://community.intel.com/en-us/profile/367365"&gt;tim18&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"&gt;&lt;EM&gt; Xeon 5506 is a Nehalem architecture with reduced QPI performance, half normal cache size, and no HyperThreading or Turbo mode. If the MKL blocking is effective for the smaller cache, and KMP_AFFINITY=compact is set, I would think that could compensate for the slower QPI. Does MKL assume that all Nehalem platforms have the larger cache?&lt;BR /&gt;I spent some time trying to understand Gennady's comment about KMP_BLOCKTIME. I think he means that small matrices, with relatively short parallel execution times, and significant serial execution times between, could benefit from reduced KMP_BLOCKTIME. But I suppose the matrix has to be fairly large to use 8 threads, as your timings indicate may be happening. I assume OP would tell us if _OMP_NESTED were in use.&lt;BR /&gt;As Gennady said, it's not possible to make relevant comments without knowing whether the question is about many small matrix operations or large enough ones that cache size becomes an issue.&lt;BR /&gt;&lt;/EM&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
The core matrix multiplications, in this case, are on complex matrices 394x768 * 768x768. My next step will be to extract this core matrix operation and make a simple test case I can share that calls zgemm3 and see how it scales vs. when buried in my application code and report back.&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Mon, 04 Jan 2010 15:57:24 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Scaling-on-multi-core-Xeon-CPU/m-p/852104#M6632</guid>
      <dc:creator>AndrewC</dc:creator>
      <dc:date>2010-01-04T15:57:24Z</dc:date>
    </item>
    <item>
      <title>Re: Scaling on multi-core Xeon CPU</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Scaling-on-multi-core-Xeon-CPU/m-p/852105#M6633</link>
      <description>&lt;DIV style="margin:0px;"&gt;
&lt;DIV id="quote_reply" style="width: 100%; margin-top: 5px;"&gt;
&lt;DIV style="margin-left:2px;margin-right:2px;"&gt;Quoting - &lt;A href="https://community.intel.com/en-us/profile/371334"&gt;vasci_intel&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"&gt;&lt;EM&gt; The core matrix multiplications, in this case, are on complex matrices 394x768 * 768x768. My next step will be to extract this core matrix operation and make a simple test case I can share that calls zgemm3 and see how it scales vs. when buried in my application code and report back.&lt;BR /&gt;&lt;BR /&gt;&lt;/EM&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
I did some tests with sample code calling zgemm3 , and it looks like I need to do closer examination of my code for thread scaling bottlenecks, as zgemm3 itself scales fairly well as the matrices get larger. &lt;BR /&gt;&lt;BR /&gt;The tables are number of threads vs the ratio of wall clock time for 1 thread. N is the dimension of a square DComplex matrix&lt;BR /&gt;&lt;BR /&gt;For N=16&lt;BR /&gt;1	1&lt;BR /&gt;2	1.382634633&lt;BR /&gt;3	1.725415255&lt;BR /&gt;4	2.661113465&lt;BR /&gt;5	2.048479939&lt;BR /&gt;6	1.620603122&lt;BR /&gt;7	1.791383007&lt;BR /&gt;8	1.979928082&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;For N=256&lt;BR /&gt;1	1&lt;BR /&gt;2	1.921579502&lt;BR /&gt;3	2.478885332&lt;BR /&gt;4	2.819896679&lt;BR /&gt;5	3.198962239&lt;BR /&gt;6	3.154567957&lt;BR /&gt;7	3.364675664&lt;BR /&gt;8	2.935450255&lt;BR /&gt;&lt;BR /&gt;For N=1024&lt;BR /&gt;1	1&lt;BR /&gt;2	1.948073886&lt;BR /&gt;3	2.790286666&lt;BR /&gt;4	3.537633931&lt;BR /&gt;5	4.244896508&lt;BR /&gt;6	4.76614606&lt;BR /&gt;7	5.264932759&lt;BR /&gt;8	5.429620953&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Mon, 04 Jan 2010 18:59:25 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Scaling-on-multi-core-Xeon-CPU/m-p/852105#M6633</guid>
      <dc:creator>AndrewC</dc:creator>
      <dc:date>2010-01-04T18:59:25Z</dc:date>
    </item>
    <item>
      <title>Re: Scaling on multi-core Xeon CPU</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Scaling-on-multi-core-Xeon-CPU/m-p/852106#M6634</link>
      <description>&lt;DIV style="margin:0px;"&gt;
&lt;DIV id="quote_reply" style="width: 100%; margin-top: 5px;"&gt;
&lt;DIV style="margin-left:2px;margin-right:2px;"&gt;Quoting - &lt;A href="https://community.intel.com/en-us/profile/367365"&gt;tim18&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"&gt;&lt;EM&gt; If you link with libiompprof5 (/Qopenmp_profile, if using ICL or IFORT), a performance summary of each threaded region should be written in guide.gvs. This will show the balance of work and barrier times among threads. Compare with various appropriate settings of KMP_AFFINITY environment variable, e.g. SET KMP_AFFINITY=compact,0,verbose. Check the echo to see that the core numbering has been understood.&lt;BR /&gt;For 2,4,6 threads try also with the threads split 50-50 between processors, but not alternating.&lt;BR /&gt;For a GUI display, the guide.gvs may be imported into VTune, or Thread Profiler could be used.&lt;BR /&gt;VTune or PTU event sampling should enable you to get more detail, assuming there is cache capacity limitation. The reduced cache size and memory bus capacity of the 5506, compared with full featured models, may become a handicap at some problem size in ?gemm. Then, it would be interesting to compare the results when you use ifort to compile your ?gemm from public source with debug symbols enabled.&lt;BR /&gt;&lt;/EM&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;BR /&gt;Hi Tim,&lt;BR /&gt;This does bring up the problem Intel has with a annoying and confusing proliferation of products all related to threading(*). The annoying part is that I am paying an "arm and a leg" for C++ Professional and Fortran Professional yet I do not get access to Vtune or Thread Profiler! Yet I get IPP which is probably less useful to most people.&lt;BR /&gt;&lt;BR /&gt;Partial list....&lt;BR /&gt;Compilers with OpenMP&lt;BR /&gt;Intel TBB&lt;BR /&gt;VTune&lt;BR /&gt;Parallel Studio&lt;BR /&gt;Thread Profiler&lt;BR /&gt;Intel IPP&lt;BR /&gt;Intel MKL&lt;BR /&gt;Intel Parallel Amplifier?&lt;BR /&gt;....&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Andrew&lt;BR /&gt;</description>
      <pubDate>Thu, 07 Jan 2010 23:13:18 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Scaling-on-multi-core-Xeon-CPU/m-p/852106#M6634</guid>
      <dc:creator>AndrewC</dc:creator>
      <dc:date>2010-01-07T23:13:18Z</dc:date>
    </item>
    <item>
      <title>Re: Scaling on multi-core Xeon CPU</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Scaling-on-multi-core-Xeon-CPU/m-p/852107#M6635</link>
      <description>&lt;DIV style="margin: 0px; height: auto;"&gt;&lt;/DIV&gt;
In my experience, the useful part of Thread Profiler for OpenMP is what you get with the openmp-profile library, which comes with the compilers. I had another go this morning at the Windows VTune Thread Profiler with no luck.&lt;BR /&gt;TBB and MKL also come with the C++ compilers (same MKL also with Fortran).&lt;BR /&gt;Parallel Studio is a package including slightly simplified C++ with OpenMP, TBB, IPP, Amplifier (simplified VTune). If it met your needs, you probably wouldn't buy the others.&lt;BR /&gt;If you didn't need all the capabilities of VTune, or were not developing for current Intel CPUs, you would probably use gprof or oprofile/CodeAnalyst.&lt;BR /&gt;</description>
      <pubDate>Fri, 08 Jan 2010 00:00:31 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Scaling-on-multi-core-Xeon-CPU/m-p/852107#M6635</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2010-01-08T00:00:31Z</dc:date>
    </item>
  </channel>
</rss>

