Solved: Cluster 2D FFT very Slow, Why?

svyatoslav_korneev · ‎07-06-2009

Hellow.

I have problem.

Intel Cluster FFT example (/opt/intel/Compiler/11.0/083/mkl/examples/cdftf) execute very slow on my cluster. And if I increase number of process, execution time decrease. Execution time statistic for "STATUS = DftiComputeForwardDM(DESC,LOCAL)", field 512*512 (first column MPI_RANK, second execution time per sec):

DFTI_FORWARD_DOMAIN = DFTI_COMPLEX
DFTI_PRECISION = DFTI_DOUBLE
DFTI_DIMENSION = 2
DFTI_LENGTHS = (512,512)
DFTI_FORWARD_SCALE = 1.0
DFTI_BACKWARD_SCALE = 1.0/(M*N)

CREATE= 0

8 process:

0 0.2209660
7 0.2209670
1 0.2229670
6 0.2209670
3 0.2229670
4 0.2229670
2 0.2229670
5 0.2219670

16 process:

0 0.2129680
3 0.2129680
1 0.2129680
6 0.2129680
4 0.2129680
5 0.2129670
2 0.2129680
7 0.2129670
13 0.2389640
9 0.2389640
15 0.2389640
11 0.2389630
12 0.2389640
14 0.2389630
8 0.2389630
10 0.2389640

32 process:

0 0.5439169
5 0.5519149
1 0.5519161
7 0.5519171
3 0.5519159
4 0.5529160
28 0.3739430
13 0.5509160
18 0.2789580
6 0.5019231
2 0.5539160
9 0.5529160
12 0.5499170
8 0.5529151
15 0.5509162
11 0.5509150
14 0.5509160
10 0.5509150
20 0.2789570
16 0.2789580
21 0.2789570
17 0.2789590
22 0.2789570
19 0.2789580
23 0.2789580
24 0.3739420
27 0.3739440
31 0.3739430
25 0.3739430
29 0.3739440
30 0.3739430
26 0.3739430

64 process:

30 1.019846
49 0.3459470
45 0.3499470
5 1.026844
0 0.3339500
2 1.021845
6 1.031843
1 1.024845
4 1.027844
3 1.022845
7 1.024844
58 0.3379490
21 1.008847
13 1.020845
33 0.6359040
31 1.023844
27 1.030843
29 1.026844
25 1.027844
28 1.016845
24 1.027843
26 1.031843
52 0.3439469
48 0.3469470
53 0.3429482
51 0.3449471
55 0.3409491
54 0.3419471
50 0.3459470
32 1.012846
38 0.3569450
37 0.3579450
36 0.4479311
35 0.4559300
39 0.3559461
34 0.4559309
44 0.3499467
41 0.3529470
40 0.3539469
46 0.3489470
47 0.3479462
43 0.3509469
42 0.3519461
59 0.3379490
57 0.3369482
62 0.3349490
63 0.3339500
61 0.3359480
60 0.3379490
56 0.2829571
10 1.019846
9 1.027843
15 1.024845
14 1.018845
8 1.020845
11 1.020844
12 1.018845
17 1.013845
18 1.014845
22 1.010846
20 1.015846
19 1.010846
23 1.010846
16 1.016845

Cluster one module config:

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel Xeon CPU 5140 @ 2.33GHz
stepping : 6
cpu MHz : 2333.423
cache size : 4096 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips : 4670.17
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel Xeon CPU 5140 @ 2.33GHz
stepping : 6
cpu MHz : 2333.423
cache size : 4096 KB
physical id : 3
siblings : 2
core id : 0
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips : 4666.87
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 2
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel Xeon CPU 5140 @ 2.33GHz
stepping : 6
cpu MHz : 2333.423
cache size : 4096 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips : 4666.79
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel Xeon CPU 5140 @ 2.33GHz
stepping : 6
cpu MHz : 2333.423
cache size : 4096 KB
physical id : 3
siblings : 2
core id : 1
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips : 4666.78
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

Cluster have 990 such modules.

Cluster start one process per one core.

I make example by:

make libem64t mpi=mpich interface=ilp64

Help me please, why it's so slow.

Svyatoslav

Vladimir_Petrov__Int · ‎07-06-2009

Svyatoslav,

First of all, looking at your data one can conclude that your cluster seem to have some problems - note, for 64 processes the times differ by a factor of 3!

Second, if the problem size is rather small and is fixed for all number of processes, the computation time will increase when you increase the number of nodes - this is caused by the size of data sent from one process to another decreasing, thus increasing the latencies.

In order to utilize the full computing power of your cluster you need to challenge it with big enough transform size. In general, the best performance (in terms of gigaflops) is achieved for transforms which utilize all the memory available on each node. However, please keep in mind that due to additional buffers being allocated the local part of the data being transformed has to occupy about 25% of the local memory.

Best regards,
Vladimir

View solution in original post

Vladimir_Petrov__Int · ‎07-06-2009

Svyatoslav,

First of all, looking at your data one can conclude that your cluster seem to have some problems - note, for 64 processes the times differ by a factor of 3!

Second, if the problem size is rather small and is fixed for all number of processes, the computation time will increase when you increase the number of nodes - this is caused by the size of data sent from one process to another decreasing, thus increasing the latencies.

In order to utilize the full computing power of your cluster you need to challenge it with big enough transform size. In general, the best performance (in terms of gigaflops) is achieved for transforms which utilize all the memory available on each node. However, please keep in mind that due to additional buffers being allocated the local part of the data being transformed has to occupy about 25% of the local memory.

Best regards,
Vladimir