Performance evaluation of a parallel matrix-marix multiply

Farea__Afrah · ‎12-19-2017

I am running parallel programing codes on a server with these properties:

Processor type:                          Intel Xeon E5345
Number of calculation nodes:    47
Calculation number of kernel:  376
The amount of node memory is:   16 GB
Performance networking:         20 Gbps  InfiniBand DDR 
File System:                    44 TB    BGFS
Operating System:                        CentOS 5.4 x86_64

The architecture of this machine is like this as depicted by lstopo tool:

There are four CPU-cores, each core with these properties:

 Family:      6
  Model:      15
  Stepping:    6
  Type:        0
  Brand:       0
  CPU Model:           Core 2 Duo  [B2] Original OEM
  Feature flags:
                       fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
                       mca cmov pat pse36 clflsh ds acpi mmx fxsr sse sse2
                       ss ht tm pbe sse3 monitor ds-cpl vmx est tm2 ssse3 cx16 xTPR dca
  Extended
  feature flags:
                       SYSCALL em64t lahf_lm

  Cache info
  L1 Instruction cache: 32KB,  8-way associative. 64 byte line size.
  L1 Data cache:        32KB,  8-way associative. 64 byte line size.
  L3 unified cache:      4MB, 16-way associative. 64 byte line size.

  TLB info
  Instruction TLB:        4x  page entries,
                       or 8x 2MB pages entries, 4-way associative
  Instruction TLB:       4K  pages, 4-way associative, 128 entries.
  Data        TLB:       4MB pages, 4-way associative,  32 entries
  L0 Data     TLB:       4MB pages, 4-way set associative, 16 entries
  L0 Data     TLB:       4MB pages, 4-way set associative, 16 entries
  Data        TLB:       4K  pages, 4-way associative, 256 entries.
  64 byte prefetching.
  The physical package supports 2 logical processors 4MB

The peak performance of the processor is FP Peak/core: 9.332 GFlop/s( please correct me if it is not).

I could not figure out the latency and bandwidth of the server but the network is infiniband DDR. Th latency is 4.0 microsecond and the bandwidth is 1.3GB/s . I am not sure here : Is this the latency and bandwidth we should care about or other latency and bandwidth within the processor.

Another question please:

I wrote a code for parallel parallel matrix multiplication on this machine. I got this output:

The output looks strange but the code is true for doing multiplication. The matrix size is divided evenly among processors, genereted loclly and after multiplication; results are sent to processor zero for aggregation. The compiler is Intel mpiicpc.

How can you interpret these results.

Thank you

TimP · ‎12-20-2017

You have chosen a forum dedicated to totally different hardware. Perhaps you might find https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures or https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring among others, closer to the topics you wish to discuss.