<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Cluster 2D FFT very Slow, Why? in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Cluster-2D-FFT-very-Slow-Why/m-p/864255#M7736</link>
    <description>&lt;DIV style="margin:0px;"&gt;Svyatoslav,&lt;BR /&gt;&lt;BR /&gt;First of all, looking at your data one can conclude that your cluster seem to have some problems - note, for 64 processes the times differ by a factor of 3!&lt;BR /&gt;&lt;BR /&gt;Second, if the problem size is rather small and is fixed for all number of processes, the computation time will increase when you increase the number of nodes - this is caused by the size of data sent from one process to another decreasing, thus increasing the latencies.&lt;BR /&gt;&lt;BR /&gt;In order to utilize the full computing power of your cluster you need to challenge it with big enough transform size. In general, the best performance (in terms of gigaflops) is achieved for transforms which utilize all the memory available on each node. However, please keep in mind that due to additional buffers being allocated the local part of the data being transformed has to occupy about 25% of the local memory.&lt;BR /&gt;&lt;BR /&gt;Best regards,&lt;BR /&gt;Vladimir&lt;BR /&gt;&lt;/DIV&gt;</description>
    <pubDate>Mon, 06 Jul 2009 14:17:19 GMT</pubDate>
    <dc:creator>Vladimir_Petrov__Int</dc:creator>
    <dc:date>2009-07-06T14:17:19Z</dc:date>
    <item>
      <title>Cluster 2D FFT very Slow, Why?</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Cluster-2D-FFT-very-Slow-Why/m-p/864254#M7735</link>
      <description>Hellow.&lt;BR /&gt;&lt;BR /&gt;I have problem.&lt;BR /&gt;&lt;BR /&gt;Intel Cluster FFT example (/opt/intel/Compiler/11.0/083/mkl/examples/cdftf) execute very slow on my cluster. And if I increase number of process, execution time decrease. Execution time statistic for "STATUS = DftiComputeForwardDM(DESC,LOCAL)", field 512*512 (first column MPI_RANK, second execution time per sec):&lt;BR /&gt;&lt;BR /&gt;DFTI_FORWARD_DOMAIN = DFTI_COMPLEX&lt;BR /&gt;DFTI_PRECISION = DFTI_DOUBLE&lt;BR /&gt;DFTI_DIMENSION = 2&lt;BR /&gt;DFTI_LENGTHS = (512,512)&lt;BR /&gt;DFTI_FORWARD_SCALE = 1.0&lt;BR /&gt;DFTI_BACKWARD_SCALE = 1.0/(M*N)&lt;BR /&gt;&lt;BR /&gt;CREATE= 0&lt;BR /&gt;&lt;BR /&gt;8 process:&lt;BR /&gt;&lt;BR /&gt; 0  0.2209660    &lt;BR /&gt; 7  0.2209670    &lt;BR /&gt; 1  0.2229670    &lt;BR /&gt; 6  0.2209670    &lt;BR /&gt; 3  0.2229670    &lt;BR /&gt; 4  0.2229670    &lt;BR /&gt; 2  0.2229670    &lt;BR /&gt; 5  0.2219670&lt;BR /&gt;&lt;BR /&gt;16 process:&lt;BR /&gt;&lt;BR /&gt; 0  0.2129680    &lt;BR /&gt; 3  0.2129680    &lt;BR /&gt; 1  0.2129680    &lt;BR /&gt; 6  0.2129680    &lt;BR /&gt; 4  0.2129680    &lt;BR /&gt; 5  0.2129670    &lt;BR /&gt; 2  0.2129680    &lt;BR /&gt; 7  0.2129670    &lt;BR /&gt; 13  0.2389640    &lt;BR /&gt; 9  0.2389640    &lt;BR /&gt; 15  0.2389640    &lt;BR /&gt; 11  0.2389630    &lt;BR /&gt; 12  0.2389640    &lt;BR /&gt; 14  0.2389630    &lt;BR /&gt; 8  0.2389630    &lt;BR /&gt; 10  0.2389640&lt;BR /&gt;&lt;BR /&gt;32 process:&lt;BR /&gt;&lt;BR /&gt;0 0.5439169 &lt;BR /&gt; 5 0.5519149 &lt;BR /&gt; 1 0.5519161 &lt;BR /&gt; 7 0.5519171 &lt;BR /&gt; 3 0.5519159 &lt;BR /&gt; 4 0.5529160 &lt;BR /&gt; 28 0.3739430 &lt;BR /&gt; 13 0.5509160 &lt;BR /&gt; 18 0.2789580 &lt;BR /&gt; 6 0.5019231 &lt;BR /&gt; 2 0.5539160 &lt;BR /&gt; 9 0.5529160 &lt;BR /&gt; 12 0.5499170 &lt;BR /&gt; 8 0.5529151 &lt;BR /&gt; 15 0.5509162 &lt;BR /&gt; 11 0.5509150 &lt;BR /&gt; 14 0.5509160 &lt;BR /&gt; 10 0.5509150 &lt;BR /&gt; 20 0.2789570 &lt;BR /&gt; 16 0.2789580 &lt;BR /&gt; 21 0.2789570 &lt;BR /&gt; 17 0.2789590 &lt;BR /&gt; 22 0.2789570 &lt;BR /&gt; 19 0.2789580 &lt;BR /&gt; 23 0.2789580 &lt;BR /&gt; 24 0.3739420 &lt;BR /&gt; 27 0.3739440 &lt;BR /&gt; 31 0.3739430 &lt;BR /&gt; 25 0.3739430 &lt;BR /&gt; 29 0.3739440 &lt;BR /&gt; 30 0.3739430 &lt;BR /&gt; 26 0.3739430&lt;BR /&gt;&lt;BR /&gt;64 process:&lt;BR /&gt;&lt;BR /&gt; 30   1.019846    &lt;BR /&gt; 49  0.3459470    &lt;BR /&gt; 45  0.3499470    &lt;BR /&gt; 5   1.026844    &lt;BR /&gt; 0  0.3339500    &lt;BR /&gt; 2   1.021845    &lt;BR /&gt; 6   1.031843    &lt;BR /&gt; 1   1.024845    &lt;BR /&gt; 4   1.027844    &lt;BR /&gt; 3   1.022845    &lt;BR /&gt; 7   1.024844    &lt;BR /&gt; 58  0.3379490    &lt;BR /&gt; 21   1.008847    &lt;BR /&gt; 13   1.020845    &lt;BR /&gt; 33  0.6359040    &lt;BR /&gt; 31   1.023844    &lt;BR /&gt; 27   1.030843    &lt;BR /&gt; 29   1.026844    &lt;BR /&gt; 25   1.027844    &lt;BR /&gt; 28   1.016845    &lt;BR /&gt; 24   1.027843    &lt;BR /&gt; 26   1.031843    &lt;BR /&gt; 52  0.3439469    &lt;BR /&gt; 48  0.3469470    &lt;BR /&gt; 53  0.3429482    &lt;BR /&gt; 51  0.3449471    &lt;BR /&gt; 55  0.3409491    &lt;BR /&gt; 54  0.3419471    &lt;BR /&gt; 50  0.3459470    &lt;BR /&gt; 32   1.012846    &lt;BR /&gt; 38  0.3569450    &lt;BR /&gt; 37  0.3579450    &lt;BR /&gt; 36  0.4479311    &lt;BR /&gt; 35  0.4559300    &lt;BR /&gt; 39  0.3559461    &lt;BR /&gt; 34  0.4559309    &lt;BR /&gt; 44  0.3499467    &lt;BR /&gt; 41  0.3529470    &lt;BR /&gt; 40  0.3539469    &lt;BR /&gt; 46  0.3489470    &lt;BR /&gt; 47  0.3479462    &lt;BR /&gt; 43  0.3509469    &lt;BR /&gt; 42  0.3519461    &lt;BR /&gt; 59  0.3379490    &lt;BR /&gt; 57  0.3369482    &lt;BR /&gt; 62  0.3349490    &lt;BR /&gt; 63  0.3339500    &lt;BR /&gt; 61  0.3359480    &lt;BR /&gt; 60  0.3379490    &lt;BR /&gt; 56  0.2829571    &lt;BR /&gt; 10   1.019846    &lt;BR /&gt; 9   1.027843    &lt;BR /&gt; 15   1.024845    &lt;BR /&gt; 14   1.018845    &lt;BR /&gt; 8   1.020845    &lt;BR /&gt; 11   1.020844    &lt;BR /&gt; 12   1.018845    &lt;BR /&gt; 17   1.013845    &lt;BR /&gt; 18   1.014845    &lt;BR /&gt; 22   1.010846    &lt;BR /&gt; 20   1.015846    &lt;BR /&gt; 19   1.010846    &lt;BR /&gt; 23   1.010846    &lt;BR /&gt; 16   1.016845&lt;BR /&gt;&lt;BR /&gt;Cluster one module config:&lt;BR /&gt;&lt;BR /&gt;processor : 0&lt;BR /&gt;vendor_id : GenuineIntel&lt;BR /&gt;cpu family : 6&lt;BR /&gt;model  : 15&lt;BR /&gt;model name : Intel Xeon CPU 5140 @ 2.33GHz&lt;BR /&gt;stepping : 6&lt;BR /&gt;cpu MHz  : 2333.423&lt;BR /&gt;cache size : 4096 KB&lt;BR /&gt;physical id : 0&lt;BR /&gt;siblings : 2&lt;BR /&gt;core id  : 0&lt;BR /&gt;cpu cores : 2&lt;BR /&gt;fpu  : yes&lt;BR /&gt;fpu_exception : yes&lt;BR /&gt;cpuid level : 10&lt;BR /&gt;wp  : yes&lt;BR /&gt;flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm&lt;BR /&gt;bogomips : 4670.17&lt;BR /&gt;clflush size : 64&lt;BR /&gt;cache_alignment : 64&lt;BR /&gt;address sizes : 36 bits physical, 48 bits virtual&lt;BR /&gt;power management:&lt;BR /&gt;&lt;BR /&gt;processor : 1&lt;BR /&gt;vendor_id : GenuineIntel&lt;BR /&gt;cpu family : 6&lt;BR /&gt;model  : 15&lt;BR /&gt;model name : Intel Xeon CPU 5140 @ 2.33GHz&lt;BR /&gt;stepping : 6&lt;BR /&gt;cpu MHz  : 2333.423&lt;BR /&gt;cache size : 4096 KB&lt;BR /&gt;physical id : 3&lt;BR /&gt;siblings : 2&lt;BR /&gt;core id  : 0&lt;BR /&gt;cpu cores : 2&lt;BR /&gt;fpu  : yes&lt;BR /&gt;fpu_exception : yes&lt;BR /&gt;cpuid level : 10&lt;BR /&gt;wp  : yes&lt;BR /&gt;flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm&lt;BR /&gt;bogomips : 4666.87&lt;BR /&gt;clflush size : 64&lt;BR /&gt;cache_alignment : 64&lt;BR /&gt;address sizes : 36 bits physical, 48 bits virtual&lt;BR /&gt;power management:&lt;BR /&gt;&lt;BR /&gt;processor : 2&lt;BR /&gt;vendor_id : GenuineIntel&lt;BR /&gt;cpu family : 6&lt;BR /&gt;model  : 15&lt;BR /&gt;model name : Intel Xeon CPU 5140 @ 2.33GHz&lt;BR /&gt;stepping : 6&lt;BR /&gt;cpu MHz  : 2333.423&lt;BR /&gt;cache size : 4096 KB&lt;BR /&gt;physical id : 0&lt;BR /&gt;siblings : 2&lt;BR /&gt;core id  : 1&lt;BR /&gt;cpu cores : 2&lt;BR /&gt;fpu  : yes&lt;BR /&gt;fpu_exception : yes&lt;BR /&gt;cpuid level : 10&lt;BR /&gt;wp  : yes&lt;BR /&gt;flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm&lt;BR /&gt;bogomips : 4666.79&lt;BR /&gt;clflush size : 64&lt;BR /&gt;cache_alignment : 64&lt;BR /&gt;address sizes : 36 bits physical, 48 bits virtual&lt;BR /&gt;power management:&lt;BR /&gt;&lt;BR /&gt;processor : 3&lt;BR /&gt;vendor_id : GenuineIntel&lt;BR /&gt;cpu family : 6&lt;BR /&gt;model  : 15&lt;BR /&gt;model name : Intel Xeon CPU 5140 @ 2.33GHz&lt;BR /&gt;stepping : 6&lt;BR /&gt;cpu MHz  : 2333.423&lt;BR /&gt;cache size : 4096 KB&lt;BR /&gt;physical id : 3&lt;BR /&gt;siblings : 2&lt;BR /&gt;core id  : 1&lt;BR /&gt;cpu cores : 2&lt;BR /&gt;fpu  : yes&lt;BR /&gt;fpu_exception : yes&lt;BR /&gt;cpuid level : 10&lt;BR /&gt;wp  : yes&lt;BR /&gt;flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm&lt;BR /&gt;bogomips : 4666.78&lt;BR /&gt;clflush size : 64&lt;BR /&gt;cache_alignment : 64&lt;BR /&gt;address sizes : 36 bits physical, 48 bits virtual&lt;BR /&gt;power management:&lt;BR /&gt;&lt;BR /&gt;Cluster have 990 such modules.&lt;BR /&gt;&lt;BR /&gt;Cluster start one process per one core.&lt;BR /&gt;&lt;BR /&gt;I make example by:&lt;BR /&gt;&lt;BR /&gt;make libem64t mpi=mpich interface=ilp64&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Help me please, why it's so slow.&lt;BR /&gt;&lt;BR /&gt;Svyatoslav&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Mon, 06 Jul 2009 10:24:03 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Cluster-2D-FFT-very-Slow-Why/m-p/864254#M7735</guid>
      <dc:creator>svyatoslav_korneev</dc:creator>
      <dc:date>2009-07-06T10:24:03Z</dc:date>
    </item>
    <item>
      <title>Re: Cluster 2D FFT very Slow, Why?</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Cluster-2D-FFT-very-Slow-Why/m-p/864255#M7736</link>
      <description>&lt;DIV style="margin:0px;"&gt;Svyatoslav,&lt;BR /&gt;&lt;BR /&gt;First of all, looking at your data one can conclude that your cluster seem to have some problems - note, for 64 processes the times differ by a factor of 3!&lt;BR /&gt;&lt;BR /&gt;Second, if the problem size is rather small and is fixed for all number of processes, the computation time will increase when you increase the number of nodes - this is caused by the size of data sent from one process to another decreasing, thus increasing the latencies.&lt;BR /&gt;&lt;BR /&gt;In order to utilize the full computing power of your cluster you need to challenge it with big enough transform size. In general, the best performance (in terms of gigaflops) is achieved for transforms which utilize all the memory available on each node. However, please keep in mind that due to additional buffers being allocated the local part of the data being transformed has to occupy about 25% of the local memory.&lt;BR /&gt;&lt;BR /&gt;Best regards,&lt;BR /&gt;Vladimir&lt;BR /&gt;&lt;/DIV&gt;</description>
      <pubDate>Mon, 06 Jul 2009 14:17:19 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Cluster-2D-FFT-very-Slow-Why/m-p/864255#M7736</guid>
      <dc:creator>Vladimir_Petrov__Int</dc:creator>
      <dc:date>2009-07-06T14:17:19Z</dc:date>
    </item>
  </channel>
</rss>

