- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
コピーされたリンク
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
X5687 normal frequency is 3.6 GHz, but E5-2670's is only 2.6 GHz. However, E5-2670 has more cores, bigger smart cache, Instruction Set Extension is AVX, vs. X5687's SSE4.2. It means if you compiled code for X5687 processor, the binary runs on E5-2670 (no big memory access, run SSE4.2 instruction on E5-2670, only 8 threads paralleling work). X5687 has better performance because of base frequency. You can measure CPU clocks on critical functions, and CPI metric can be used.
> Maybe I need to compile with ifort to target specific architectures? Compiling a binary for each processor family did not seem to make much of a difference, but maybe there are other flags I should try.
1. That is right! You have to build code for the target system respectively. For example, try to run more parallel tasks on E5-2670, omp_set_num_threads(16) can be used, if you use OpenMP*.
2. Use AVX to instead of SSE4.2 - I recall that you can use advanced compiler switches, such as "-xHost"? Other advanced compiler switcher "-O2" or "-O3" can be used?
3. You measure time spent on critical function by using VTune report, you can filter result by function, module. Also, start time / end time on critical function can be observed in time line panel of bottom-up report.
4. You can use general-exploration analysis to compare vtune results from two sessions (which is different binary for different processor), L1/L2/LLC cache miss
5. Using Concurrency analysis to know Parallelism Level.
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
You need also take into account how the same code will benefit from newer generation of Processors (Xeon SB) in your case. For example if profiled code which is not easily parallelized and scales well with increased CPU frequency so in this case I think that Westmere CPU can outperform SB CPU.
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Peter Wang (Intel) wrote:Sinec the real problem I have is that two "identical" systems are giving me drastically different results, how can I use this "general-exploration analysis" to compare the two "identical" systems? I know there is a difference between different processor families but with two identical processors I should be getting very similar results. I guess I am not sure what I should be looking at/for between two different runs on two different hosts with the same processor.
4. You can use general-exploration analysis to compare vtune results from two sessions (which is different binary for different processor), L1/L2/LLC cache miss
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
The processors may be identical, but the runtimes are not, so there must be one or more reasons for the difference.
It would not make sense to do a *static* analysis of the binary (since the binary is identical), but the "general exploration analysis" of VTune is a dynamic runtime analysis that is likely to show ways in which the execution differs across the two systems.
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Good luck! Sometimes seeing "how" the runs are different points fairly directly to "why", but sometimes it does not really help at all :-(
We have a Xeon v3 node in our lab that gives poor and variable performance compared to other nodes that seem identical. It took a while to realize that there is a problem with the cooling system on this particular box, so it is overheating very quickly and throttling the clocks. We will ship it back to the vendor, but in the meantime I was encouraged to build a bunch of test code so that we can monitor temperature and frequency on all of the boxes.
Using these tools we discovered that there is a fair amount of variability in the average clock speed of identical Xeon v3 processors when they are running in the power-limited regime. (So far this power limitation only happens when we run LINPACK using more than 1/2 of the cores, but we are on the lookout for other codes that might also run into the power limits.)
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
>>>I know there is a difference between different processor families but with two identical processors I should be getting very similar results.>>>
I presume that both processors are running on different physical machines.There are many factors which can skew the results. Start from checking CPUs temperature during the load , verify that voltage fluctuation is in norm. From the software point of view and if OS is Windows I would like to recommend perform system wide testing with such a tool like Windows Performance Recorder (Xperf) although VTune can provide similar information WPR is gathering a lot of so called performance events from various providers(Disk I/O, DPCs,ISRs etc....).
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
>... like Windows Performance Recorder (Xperf) although VTune can provide similar information WPR is gathering a lot of so called performance events from various providers(Disk I/O, DPCs,ISRs etc....).
Also can use VTune : measure wait time / wait counts, if there are some IO waits or threads' pending, etc.
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Peter Wang (Intel) wrote:
>... like Windows Performance Recorder (Xperf) although VTune can provide similar information WPR is gathering a lot of so called performance events from various providers(Disk I/O, DPCs,ISRs etc....).
Also can use VTune : measure wait time / wait counts, if there are some IO waits or threads' pending, etc.
Yes of course. I usually start system wide profiling with the help of WPR and later switch to VTune for in depth application profiling.
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
I think vtune summary tells you processor frequency BUT i always look (on linux system) at /proc/cpuinfo whenever running jobs. some nodes will have fixed but some will have variable frequencies (and for the later you may need to warm-up if you want to compare peak code performances).
