- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
X5687 normal frequency is 3.6 GHz, but E5-2670's is only 2.6 GHz. However, E5-2670 has more cores, bigger smart cache, Instruction Set Extension is AVX, vs. X5687's SSE4.2. It means if you compiled code for X5687 processor, the binary runs on E5-2670 (no big memory access, run SSE4.2 instruction on E5-2670, only 8 threads paralleling work). X5687 has better performance because of base frequency. You can measure CPU clocks on critical functions, and CPI metric can be used.
> Maybe I need to compile with ifort to target specific architectures? Compiling a binary for each processor family did not seem to make much of a difference, but maybe there are other flags I should try.
1. That is right! You have to build code for the target system respectively. For example, try to run more parallel tasks on E5-2670, omp_set_num_threads(16) can be used, if you use OpenMP*.
2. Use AVX to instead of SSE4.2 - I recall that you can use advanced compiler switches, such as "-xHost"? Other advanced compiler switcher "-O2" or "-O3" can be used?
3. You measure time spent on critical function by using VTune report, you can filter result by function, module. Also, start time / end time on critical function can be observed in time line panel of bottom-up report.
4. You can use general-exploration analysis to compare vtune results from two sessions (which is different binary for different processor), L1/L2/LLC cache miss
5. Using Concurrency analysis to know Parallelism Level.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You need also take into account how the same code will benefit from newer generation of Processors (Xeon SB) in your case. For example if profiled code which is not easily parallelized and scales well with increased CPU frequency so in this case I think that Westmere CPU can outperform SB CPU.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Peter Wang (Intel) wrote:Sinec the real problem I have is that two "identical" systems are giving me drastically different results, how can I use this "general-exploration analysis" to compare the two "identical" systems? I know there is a difference between different processor families but with two identical processors I should be getting very similar results. I guess I am not sure what I should be looking at/for between two different runs on two different hosts with the same processor.
4. You can use general-exploration analysis to compare vtune results from two sessions (which is different binary for different processor), L1/L2/LLC cache miss
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The processors may be identical, but the runtimes are not, so there must be one or more reasons for the difference.
It would not make sense to do a *static* analysis of the binary (since the binary is identical), but the "general exploration analysis" of VTune is a dynamic runtime analysis that is likely to show ways in which the execution differs across the two systems.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Good luck! Sometimes seeing "how" the runs are different points fairly directly to "why", but sometimes it does not really help at all :-(
We have a Xeon v3 node in our lab that gives poor and variable performance compared to other nodes that seem identical. It took a while to realize that there is a problem with the cooling system on this particular box, so it is overheating very quickly and throttling the clocks. We will ship it back to the vendor, but in the meantime I was encouraged to build a bunch of test code so that we can monitor temperature and frequency on all of the boxes.
Using these tools we discovered that there is a fair amount of variability in the average clock speed of identical Xeon v3 processors when they are running in the power-limited regime. (So far this power limitation only happens when we run LINPACK using more than 1/2 of the cores, but we are on the lookout for other codes that might also run into the power limits.)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>I know there is a difference between different processor families but with two identical processors I should be getting very similar results.>>>
I presume that both processors are running on different physical machines.There are many factors which can skew the results. Start from checking CPUs temperature during the load , verify that voltage fluctuation is in norm. From the software point of view and if OS is Windows I would like to recommend perform system wide testing with such a tool like Windows Performance Recorder (Xperf) although VTune can provide similar information WPR is gathering a lot of so called performance events from various providers(Disk I/O, DPCs,ISRs etc....).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>... like Windows Performance Recorder (Xperf) although VTune can provide similar information WPR is gathering a lot of so called performance events from various providers(Disk I/O, DPCs,ISRs etc....).
Also can use VTune : measure wait time / wait counts, if there are some IO waits or threads' pending, etc.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Peter Wang (Intel) wrote:
>... like Windows Performance Recorder (Xperf) although VTune can provide similar information WPR is gathering a lot of so called performance events from various providers(Disk I/O, DPCs,ISRs etc....).
Also can use VTune : measure wait time / wait counts, if there are some IO waits or threads' pending, etc.
Yes of course. I usually start system wide profiling with the help of WPR and later switch to VTune for in depth application profiling.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think vtune summary tells you processor frequency BUT i always look (on linux system) at /proc/cpuinfo whenever running jobs. some nodes will have fixed but some will have variable frequencies (and for the later you may need to warm-up if you want to compare peak code performances).
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page