- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am running parallel programing codes on a server with these properties:
Processor type: Intel Xeon E5345
Number of calculation nodes: 47
Calculation number of kernel: 376
The amount of node memory is: 16 GB
Performance networking: 20 Gbps InfiniBand DDR
File System: 44 TB BGFS
Operating System: CentOS 5.4 x86_64
The architecture of this machine is like this as depicted by lstopo tool:
There are four CPU-cores, each core with these properties:
Family: 6
Model: 15
Stepping: 6
Type: 0
Brand: 0
CPU Model: Core 2 Duo [B2] Original OEM
Feature flags:
fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflsh ds acpi mmx fxsr sse sse2
ss ht tm pbe sse3 monitor ds-cpl vmx est tm2 ssse3 cx16 xTPR dca
Extended
feature flags:
SYSCALL em64t lahf_lm
Cache info
L1 Instruction cache: 32KB, 8-way associative. 64 byte line size.
L1 Data cache: 32KB, 8-way associative. 64 byte line size.
L3 unified cache: 4MB, 16-way associative. 64 byte line size.
TLB info
Instruction TLB: 4x page entries,
or 8x 2MB pages entries, 4-way associative
Instruction TLB: 4K pages, 4-way associative, 128 entries.
Data TLB: 4MB pages, 4-way associative, 32 entries
L0 Data TLB: 4MB pages, 4-way set associative, 16 entries
L0 Data TLB: 4MB pages, 4-way set associative, 16 entries
Data TLB: 4K pages, 4-way associative, 256 entries.
64 byte prefetching.
The physical package supports 2 logical processors 4MB
The peak performance of the processor is FP Peak/core: 9.332 GFlop/s( please correct me if it is not).
I could not figure out the latency and bandwidth of the server but the network is infiniband DDR. Th latency is 4.0 microsecond and the bandwidth is 1.3GB/s . I am not sure here : Is this the latency and bandwidth we should care about or other latency and bandwidth within the processor.
Another question please:
I wrote a code for parallel parallel matrix multiplication on this machine. I got this output:
The output looks strange but the code is true for doing multiplication. The matrix size is divided evenly among processors, genereted loclly and after multiplication; results are sent to processor zero for aggregation. The compiler is Intel mpiicpc.
How can you interpret these results.
Thank you
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You have chosen a forum dedicated to totally different hardware. Perhaps you might find https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures or https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring among others, closer to the topics you wish to discuss.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page