Profiling tools... (slightly off-topic)

k_sarnath · ‎11-13-2009

This is a slightly off-topic discussion. I could not find a better place to post this.

This post has 2 parts to it.

I am also going to do some objective comparisons with CUDA. Please dont get touchy. Urge you too look @ it an objective way. Many Thanks!

PART - 1
----------
I just started to work on developing tuned intel code. I figure that there are performance counters to look @ L1 cache miss, hits, core stalls, pretty much everything about instruction life-cycle.. But they all require OS level privilege (wrmsr) to do it. I need a driver to get this basic profiling.

However, I find that all of Intel tools are commercial... Hmm.. So sad! All I need is a small driver through which I could tune my assembly code. I dont need big fat GUIs telling me what to do. Basically, I dont require an "Idiot series" software. I need a software that will just get me counter values. I can figure out the rest.

I urge Intel software division to release atleast a basic tool for profiling code for developers.. Come on, I have bought your hardware... Dont I deserve at least a simple tool for free??

How many intel developers really use advanced techniques like cache-blocking, register blocking, DTLB blocking, Keeping functional units occupiedetc.to get performance from the code? I doubt if that would be many. The only thing Intel would say at this point is "Use our compilers; Use our software; Use Intel MKL". But yeah, times are changing. You may need to re-think your strategy.

There is a reason why people get highly excited by speedups provided by technologies like CUDA. That is because they dont know how good intel CPU could be. I think the main reason is that Intel did NOT spread awareness about their superscalar processor as much as they are about multi-core CPUs.

I think it is time Intel release
1. Set of coding guidelines that demonstrate nice techniques to developers
2. Provide a basic profiling tool FREE of cost to developers to understand Intel cores better.

Note that CUDA programming model automatically takes care of "Register and Cache blocking" -- which is a major reason for effortless performance.

PART-2
---------
I attended an Intel conference yesterday in Bangalore. I understand that the micro-architecture keeps changing all the time..(Every few years) -- meaning my optimized hand-written assembly code will NOT guaranteedly work optimized on newer versions... Thats a great cause of concern. Even MKL stores one version of code for each micro-architecture...

Thats just a pain.

Let us look @ CUDA in an objective way. CUDA code is compiled to an intermdiate binary(PTXvirtual architecture)which is closer to the native architecture. At run-time, depending on the architecture, this code is translated. So, I dont have to worry about anything. Intel should probably look @ this to help developers.

Thanks,
Best Regards,
Sarnath

k_sarnath · ‎11-13-2009

Check this thread out (below)in NV forums... One person is talking about how CUDA is fast... There is another person who is saying how he optimized this CPU code to run 6000x faster :-(

http://forums.nvidia.com/index.php?showtopic=150697

TimP · ‎11-13-2009

Quoting - k_sarnath

I figure that there are performance counters to look @ L1 cache miss, hits, core stalls, pretty much everything about instruction life-cycle.. But they all require OS level privilege (wrmsr) to do it. I need a driver to get this basic profiling.

However, I find that all of Intel tools are commercial...
How many intel developers really use advanced techniques like cache-blocking, register blocking, DTLB blocking, Keeping functional units occupiedetc.to get performance from the code?

Note that CUDA programming model automatically takes care of "Register and Cache blocking" -- which is a major reason for effortless performance.
________________________________________
oprofile is the usual open software tool for collecting performance counters. I don't see that you'd benefit from further dispersion of effort. The single license for VTune, PTU, SEP doesn't come close to defraying the investment in the group of such tools.

The primary requirement for these tools is to support the needs of independent software developers. They make the decisions on how much effort can be put into low level optimizations.
There's been a lot of effort put into CUDA programming. Not "effortless" by any stretch of the imagination.
You must have noticed the announcements about Ct from Intel and Rapidmind. It does a similar thing in a slightly higher level model by adding a namespace to C++ with run-time support including JIT translation and TBB. The level of run-time optimization is controlled by environment variable.
Besides that, it seems there is continued development of parallel programming constructs, some looking like a combination of C99 and Fortran 2008.
I don't see how you can criticize Intel for promoting vector parallel programming models when you praise CUDA for carrying the same thing further.
There's no doubt that building run-time optimization into the platform will get continued attention. Perhaps you are fooled by the absence of marketing language in the public references to the incorporation of the former Transmeta and Elbrus technologies and principal personnel into Intel development.

k_sarnath · ‎11-15-2009

Quoting - tim18

The primary requirement for these tools is to support the needs of independent software developers. They make the decisions on how much effort can be put into low level optimizations.
There's been a lot of effort put into CUDA programming. Not "effortless" by any stretch of the imagination.
You must have noticed the announcements about Ct from Intel and Rapidmind. It does a similar thing in a slightly higher level model by adding a namespace to C++ with run-time support including JIT translation and TBB. The level of run-time optimization is controlled by environment variable.
Besides that, it seems there is continued development of parallel programming constructs, some looking like a combination of C99 and Fortran 2008.
I don't see how you can criticize Intel for promoting vector parallel programming models when you praise CUDA for carrying the same thing further.
There's no doubt that building run-time optimization into the platform will get continued attention. Perhaps you are fooled by the absence of marketing language in the public references to the incorporation of the former Transmeta and Elbrus technologies and principal personnel into Intel development.

Oprofileworks only for Linux. I am not sure if that would return the performance counters I would be interested in.

I dont criticize intel for promoting multi-core programming. I never said that. All I am saying is that Intel should first help developers to write better code by

1. Releasing a very basic library-cum-driver for profiling the various performance counters on all platforms (windows, linux etc..) for "FREE"

2. The run-time translation to releive developers of writing code for every micro-architecture change.

By 'effortless' what I meant was : I have to put extra effort in Intel to do "cache blocking" and "register blocking". In CUDA, these are natural consequences of the programming model. So it is in-built which gives good performance obviously.. That was my point. I was not talking about complexity of CUDA programming....

Finally, I am only giving feedback to Intel as a developer..... I am concerned that people might jump off easily to CUDA without knowing how good INtel is... THats my Point. Thanks for answering...

TimP · ‎11-15-2009

Quoting - k_sarnath

Oprofileworks only for Linux. I am not sure if that would return the performance counters I would be interested in.

Oprofile on Windows usually is run under the CodeAnalyst disguise. oprofile is said to be GPL, you must have rights to some source materials so you could add the events you have in mind, unless somehow the legalities of writing a portable Windows driver are sufficient to prevent it. If there were enough interest in a more public oprofile for Windows, no doubt someone would do it.

k_sarnath · ‎11-15-2009

Quoting - tim18

Oprofile on Windows usually is run under the CodeAnalyst disguise. oprofile is said to be GPL, you must have rights to some source materials so you could add the events you have in mind, unless somehow the legalities of writing a portable Windows driver are sufficient to prevent it. If there were enough interest in a more public oprofile for Windows, no doubt someone would do it.

Tim, Thanks a LOT for answering.... I was not aware of CodeAnalyst before.

I just searched... CodeAnalyst is a FREE tool from AMD.... Thats exactly what I am telling Intel to do.

Wikipdia says it works only with AMD CPUs. The site http://www.virtualdub.org/blog/pivot/entry.php?id=288 says that TBS featureof AMD codeanalyst works fine with Intel CPU...but the EBS (Event Based Sampling) does NOT.

I dont think I would use a tool from AMD on an Intel CPU. You never know..Even within Intel CPUs, there are some performance counters that are MODEL SPECIFIC and NOT architecture specific... So, there are lot of things out there within INtel itself... I would not risk running an AMD tool on my INtel CPU...

Well, AMD has read the market... WHat more can I say... Good to see that though....

I wish Intel did something like that and develop "awareness" among budding developers... especially in universities..

Thats my 2 cents feedback to Intel. Hope some1 is listening..

TimP · ‎11-16-2009

Now you have a proof point showing that Wikipedia isn't necessarily correct.
Once again, oprofile is under Gnu Public License. Read up on what that means.
Last time I read the EULAs, you were in violation if you ran CodeAnalyst on ACML. No such problem with MKL.
You could ask about plans for academic licensing on Parallel Studio forum.

k_sarnath · ‎11-16-2009

Quoting - tim18

Now you have a proof point showing that Wikipedia isn't necessarily correct.
Once again, oprofile is under Gnu Public License. Read up on what that means.
Last time I read the EULAs, you were in violation if you ran CodeAnalyst on ACML. No such problem with MKL.
You could ask about plans for academic licensing on Parallel Studio forum.

I will check Oprofile. Thanks for your inputs!