I am running parallel programing codes on a server with these properties:
Processor type: Intel Xeon E5345 Number of calculation nodes: 47 Calculation number of kernel: 376 The amount of node memory is: 16 GB Performance networking: 20 Gbps InfiniBand DDR File System: 44 TB BGFS Operating System: CentOS 5.4 x86_64
The architecture of this machine is like this as depicted by lstopo tool:
Family: 6 Model: 15 Stepping: 6 Type: 0 Brand: 0 CPU Model: Core 2 Duo [B2] Original OEM Feature flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflsh ds acpi mmx fxsr sse sse2 ss ht tm pbe sse3 monitor ds-cpl vmx est tm2 ssse3 cx16 xTPR dca Extended feature flags: SYSCALL em64t lahf_lm Cache info L1 Instruction cache: 32KB, 8-way associative. 64 byte line size. L1 Data cache: 32KB, 8-way associative. 64 byte line size. L3 unified cache: 4MB, 16-way associative. 64 byte line size. TLB info Instruction TLB: 4x page entries, or 8x 2MB pages entries, 4-way associative Instruction TLB: 4K pages, 4-way associative, 128 entries. Data TLB: 4MB pages, 4-way associative, 32 entries L0 Data TLB: 4MB pages, 4-way set associative, 16 entries L0 Data TLB: 4MB pages, 4-way set associative, 16 entries Data TLB: 4K pages, 4-way associative, 256 entries. 64 byte prefetching. The physical package supports 2 logical processors 4MB
The peak performance of the processor is FP Peak/core: 9.332 GFlop/s( please correct me if it is not).
I could not figure out the latency and bandwidth of the server but the network is infiniband DDR. Th latency is 4.0 microsecond and the bandwidth is 1.3GB/s . I am not sure here : Is this the latency and bandwidth we should care about or other latency and bandwidth within the processor.
Another question please:
I wrote a code for parallel parallel matrix multiplication on this machine. I got this output:
The output looks strange but the code is true for doing multiplication. The matrix size is divided evenly among processors, genereted loclly and after multiplication; results are sent to processor zero for aggregation. The compiler is Intel mpiicpc.
How can you interpret these results.
You have chosen a forum dedicated to totally different hardware. Perhaps you might find https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures or https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring among others, closer to the topics you wish to discuss.