This post has 2 parts to it.
I am also going to do some objective comparisons with CUDA. Please dont get touchy. Urge you too look @ it an objective way. Many Thanks!
PART - 1
I just started to work on developing tuned intel code. I figure that there are performance counters to look @ L1 cache miss, hits, core stalls, pretty much everything about instruction life-cycle.. But they all require OS level privilege (wrmsr) to do it. I need a driver to get this basic profiling.
However, I find that all of Intel tools are commercial... Hmm.. So sad! All I need is a small driver through which I could tune my assembly code. I dont need big fat GUIs telling me what to do. Basically, I dont require an "Idiot series" software. I need a software that will just get me counter values. I can figure out the rest.
I urge Intel software division to release atleast a basic tool for profiling code for developers.. Come on, I have bought your hardware... Dont I deserve at least a simple tool for free??
How many intel developers really use advanced techniques like cache-blocking, register blocking, DTLB blocking, Keeping functional units occupiedetc.to get performance from the code? I doubt if that would be many. The only thing Intel would say at this point is "Use our compilers; Use our software; Use Intel MKL". But yeah, times are changing. You may need to re-think your strategy.
There is a reason why people get highly excited by speedups provided by technologies like CUDA. That is because they dont know how good intel CPU could be. I think the main reason is that Intel did NOT spread awareness about their superscalar processor as much as they are about multi-core CPUs.
I think it is time Intel release
1. Set of coding guidelines that demonstrate nice techniques to developers
2. Provide a basic profiling tool FREE of cost to developers to understand Intel cores better.
Note that CUDA programming model automatically takes care of "Register and Cache blocking" -- which is a major reason for effortless performance.
I attended an Intel conference yesterday in Bangalore. I understand that the micro-architecture keeps changing all the time..(Every few years) -- meaning my optimized hand-written assembly code will NOT guaranteedly work optimized on newer versions... Thats a great cause of concern. Even MKL stores one version of code for each micro-architecture...
Thats just a pain.
Let us look @ CUDA in an objective way. CUDA code is compiled to an intermdiate binary(PTXvirtual architecture)which is closer to the native architecture. At run-time, depending on the architecture, this code is translated. So, I dont have to worry about anything. Intel should probably look @ this to help developers.
There's been a lot of effort put into CUDA programming. Not "effortless" by any stretch of the imagination.
You must have noticed the announcements about Ct from Intel and Rapidmind. It does a similar thing in a slightly higher level model by adding a namespace to C++ with run-time support including JIT translation and TBB. The level of run-time optimization is controlled by environment variable.
Besides that, it seems there is continued development of parallel programming constructs, some looking like a combination of C99 and Fortran 2008.
I don't see how you can criticize Intel for promoting vector parallel programming models when you praise CUDA for carrying the same thing further.
There's no doubt that building run-time optimization into the platform will get continued attention. Perhaps you are fooled by the absence of marketing language in the public references to the incorporation of the former Transmeta and Elbrus technologies and principal personnel into Intel development.
Oprofileworks only for Linux. I am not sure if that would return the performance counters I would be interested in.
I dont criticize intel for promoting multi-core programming. I never said that. All I am saying is that Intel should first help developers to write better code by
1. Releasing a very basic library-cum-driver for profiling the various performance counters on all platforms (windows, linux etc..) for "FREE"
2. The run-time translation to releive developers of writing code for every micro-architecture change.
By 'effortless' what I meant was : I have to put extra effort in Intel to do "cache blocking" and "register blocking". In CUDA, these are natural consequences of the programming model. So it is in-built which gives good performance obviously.. That was my point. I was not talking about complexity of CUDA programming....
Finally, I am only giving feedback to Intel as a developer..... I am concerned that people might jump off easily to CUDA without knowing how good INtel is... THats my Point. Thanks for answering...
Tim, Thanks a LOT for answering.... I was not aware of CodeAnalyst before.
I just searched... CodeAnalyst is a FREE tool from AMD.... Thats exactly what I am telling Intel to do.
Wikipdia says it works only with AMD CPUs. The site http://www.virtualdub.org/blog/pivot/entry.php?id=288 says that TBS featureof AMD codeanalyst works fine with Intel CPU...but the EBS (Event Based Sampling) does NOT.
I dont think I would use a tool from AMD on an Intel CPU. You never know..Even within Intel CPUs, there are some performance counters that are MODEL SPECIFIC and NOT architecture specific... So, there are lot of things out there within INtel itself... I would not risk running an AMD tool on my INtel CPU...
Well, AMD has read the market... WHat more can I say... Good to see that though....
I wish Intel did something like that and develop "awareness" among budding developers... especially in universities..
Thats my 2 cents feedback to Intel. Hope some1 is listening..
Once again, oprofile is under Gnu Public License. Read up on what that means.
Last time I read the EULAs, you were in violation if you ran CodeAnalyst on ACML. No such problem with MKL.
You could ask about plans for academic licensing on Parallel Studio forum.