MKL: P4 945 (2x3.4 GHz) vs. Core 2Duo E6600/E6700 (2x 2.4 GHz or 2.67 GHz)

andriyr · ‎11-10-2006

Hello,

We have to buy several High-Performance workstations for use software based on MKL x64.

Could you please provide any information about performance comparison for P4 945 vs. Core 2 Duo E6600/E6700 for matrix operations using MKL?

Will it be really much faster if we will buy Core 2 Duo (in Internet everyone praise Core 2 Duo) or it is recommended to use processors with real high frequency (P4 945)?

THANK YOU!

TimP · ‎11-10-2006

From MKL 8.1 on, advantage is taken of the ability of Core 2 Duo to execute up to 2 parallel floating point instructions per clock cycle, so MKL performance may be better than even the 3.46Ghz Pentium D. Any of the threaded MKL functions will get a big boost from dual core, leaving the single core P4 way behind.
Here are my results for DGEMM MKL 8.1 with ifort:
matrix dimensions (20,25)x(25,25) (50,25)x(25,25) (101,25)x(25,25)

Pentium D 3.46Ghz 4226 4714 4958 Gflops
Core 2 Duo 2.933Ghz 2644 4956 5466

MKL DGEMM called by the gfortran MATMUL interface is nearly as good.
Note that these are not verified by anyone else and may not agree with benchmarks posted by the MKL team.
That Pentium D makes a lot more noise than the Core 2 Duo. Pentium D 3.0Ghz was more popular, and an excellent choice prior to introduction of Core 2 Duo.

You might have to restrict your usage to unthreaded cases to find an advantage in your P4.

TimP · ‎11-10-2006

Without the decimal point, those are Mflops (Million floating point operations per second).

andriyr · ‎11-11-2006

Tim, Thank you very much! Your information is very useful for us.

You have compare very expensive Core 2 Duo ( 2x2.933Ghz) with P4 945 and I think be price-per-performance Intel PentiumD 945 wins and we finallychoose PentiumD 945 thus we have it tested and we are very satisfied with it performance.

Thank you again.

evgeny · ‎08-09-2007

Hi Tim,

I have a related question. My application runs on Intel Xeon machine (3.0 Ghz, 2 processors x cores = 4 cores total, under linux. No threading other than that in MKL is attempted.
Set 2 environment variables:

setenv OMP_NUM_THREADS 2
setenv KMP_AFFINITY 'verbose,scatter'
Running ZGEMM with A^T*B, two tall matrices (say 10,000,000 x 60)
I got about 5.8 GFlops with 1 thread, and 7.8 GFlops with 2 threads.

Any suggestions on what to do to get better multicore performance ?

Please see the details below (appended )
Many thanks,

Evgeny

KMP_AFFINITY: Affinity capable, using global cpuid instr info
KMP_AFFINITY: Initial OS proc set respected:
{0,1,2,3}
KMP_AFFINITY: 4 available OS procs - Uniform topology of
KMP_AFFINITY: 2 packages x 2 cores/pkg x 1 threads/core (4 total cores)
KMP_AFFINITY: OS proc to physical thread map ([] => level not in map):
KMP_AFFINITY: OS proc 0 maps to package 0 core 0 [thread 0]
KMP_AFFINITY: OS proc 2 maps to package 0 core 1 [thread 0]
KMP_AFFINITY: OS proc 1 maps to package 3 core 0 [thread 0]
KMP_AFFINITY: OS proc 3 maps to package 3 core 1 [thread 0]
KMP_AFFINITY: Internal thread 0 bound to OS proc set {0}
KMP_AFFINITY: Internal thread 1 bound to OS proc set {1}

********* Content of /proc/cpuinfo *********************
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel Xeon CPU 5160 @ 3.00GHz
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
runqueue : 0
stepping : 6
cpu MHz : 2992.537
cache size : 4096 KB
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss
ht tm ferr syscall nx lm sse3 monitor ds-cpl gv3 tm2
bogomips : 5976.88
clflush size : 64
address sizes : 36 bits physical, 48 bits virtual

********* Linked with ***********************
mkl/9.1.021/lib/em64t/libmkl_em64t.a
and libguide.a

TimP · ‎08-10-2007

How was it when you used 4 threads?

evgeny · ‎08-13-2007

I ran with 4 threads and got approximately 9719 MFlops/second. The statistics is as follows:

1 thread: 5869 MFlops (1.0x)
2 threads: 7792 MFlops (1.33x)
4 threads: 9719 MFlops (1.66x)

So MKL's single thread performance is really good for the problem in question, but with 2 and 4 threads the speedup is not impressive at all.

Here is what I did :
setenv OMP_NUM_THREADS 4
setenv KMP_AFFINITY 'verbose,scatter'

Do you know of a better way to use MKL ZGEMM multithreaded ?

Thank you again,

Evgeny

TimP · ‎08-13-2007

I would think that MKL would be cache blocked so there is no advantage in the scatter option to distribute the 2 threads across both L2 caches. I don't know whether scatter is distinguished from compact in the 4 thread case; I would have thought compact would be preferred, in case there are cache lines shared between threads.

evgeny · ‎08-13-2007

Let me make sure I understand the suggestion.
For 2 threads, should I use

setenv OMP_NUM_THREADS 2
setenv KMP_AFFINITY 'verbose,compact'

Are the settings above the right thing to do ? Any other things one can try ?

Thank you,

Evgeny

TimP · ‎08-13-2007

It might be interesting to see whether the compact option makes any difference with 2 or 4 threads.