Profiler of TI's Code Composer Studio - some suggestion
I'd like to compare some aspects of the functionality between intel's VTune 6.0 and profiler of TI's Code Composer Studio 2.0. I hope this comarison will be helpful for development and application of VTune.
My development work mostly focus on the field of on speech signal processing. About 6 months ago I was developing a G.729 speech codec on TI's 62x DSP platform. Both the hardware and development environment (Code Composer Studio - CCS 2.0) are powerful. CCS 2.0 also has a profiler. It's working method is different from that of VTune in some ways, at least to me, from a user's point of view. The profiler is integrated with the IDE - CCS 2.0. A user can add any function and segment of codes to the profiler. To me, the most important of all is, the profiler seems to get precise number of CPU cycles (for example 665795 for one of my key function in the algorithm) for everything being profiled. The profile result is displayed in a view inside the IDE, with values changing as the program running. I don't know the way they implement this kind of profiling. Or does this precise counting of CPU cycles need some special support of the CPU hardware?
VTune, on the other hand, use EBS to do the CPU events counting. But there is a Sample After Value, which cannot be too small. If I set it to 1, my hard disk will be full in no time. And VTune's sampling is a system-wide operation. This is necessary, cause I can know how much time my program spend on calling system APIs or MFC functions relating to the user interface. But If VTune can allow user to specify what part of his own program to profile and don't include anything else in profiling, a lot of time, disk space and memory will be saved. At the same time, profiling targeted at a small part of the program can be more refined and can generate more accurate results.
VTune has a very powerful utility - the Call Graph, which TI does not has a counterpart as far as I know. It's very useful to locate the heaviest excution path and function in the program being profiled. But once I HAS located the heaviest function, the remaining work mostly involve tuning this function (or even a segment of codes) again and again. At this time generating the same profiling results (including call graph and sampling) for the parts that the user are currently not interested in (including "light" parts of the user's program and OS modules, etc.) is a waste of time and resource.
Anyway, my technical experience and knowledge are very limited. I don't know whether what I have pointed out are reasonable or objective. If not, please point out my error and I will be very happy to receive your correction. I'm sorry if my comment on VTune make you uncomfortable. But I do hope this can provide some useful ideas from the user's point of view. I do hope VTune has the most powerful functionality and is easy to use.
I think, VTune can help you here also. Do you know, that profiled application itself can control VTune data collection process ?
VTune has very interesting feature called "Pause/Resume". When VTune is in resumed state, it collects data, when in paused state it simply waits for "resume" command. By default, each activity is configured to start collection in resumed state, but you can change this in "Advanced Activity Configuration" dialog.
When activity is running, you can see "Pause/Resume" button on the toolbar. If VTune is now in paused state, the button is called "Resume", else the same button is called "Pause". Pressing "Pause" button will pause the current data collector, pressing "Resume" will resume it.
But the most interesting feature is ability to press this button from inside your application ! VTune comes with 3 additional files: VTuneAPI.h, VTuneAPI.lib and VTuneAPI.dll. The first 2 files resides in VTune installation tree, while the DLL is placed into your system directory. VTuneAPI.h is a C/C++ header file that declares 2 global functions: VTPause() and VTResume(). Calling VTPause() has exactly the same effect as manually pressing "Pause" button in VTune toolbar, VTResume() - "Resume" button.
Using VTuneAPI allows you to help VTune to collect only relevant data. But do not forget - VTuneAPI functions has system-wide effect, so if you will call VTPause() and VTResume() at the same time from different threads or processes the result is unpredictable.
Thanks for your suggestion! You are really a master of VTune. :)
I've tried pause and resume API functions in my program. It really worked well. I still have another question: can I exclude other running programs (such as IE) from VTune's sampling activity? Although I can turn on and off VTune's sampling activity inside my program being profiled using the API functions, once the activity starts to run, it collects sampling data for all the processes running on the system. I don't think this is a necessity.
Thanks! In fact, I love the feeling of contemplating, discussing and solving such tech problems. I like to share such questions with people with the same interest. This forum is a very good place. I got a lot help here, especially from you and kdmitry. You are both warmhearted helper. Thank you!
thank you for your kind words! indeed it's our pleasure to get your questions and your opinions on VTune product!
Regarding your question, why sampling collector collects info on all runnng processes:
Well, I don't think that it is a drawback. Actually you pay nothing for that. The intrusiveness of the collection is very low. In addition when viewing your results you can focus on the very exact process/module/function/source line/assembly line you wish.
"Sampling after" inserts some probability issue in the results, but this should statistically work for big numbers. These "big numbers" should appear in the bottle necks of your program. Tuning methodology suggests to concetrate the optimization efforts mainly in these places - with big number of samples.
In you description of Code Composer Studio - CCS 2.0, you say that you are able to get the precise CPU clock count per function. May I ask you why this is so important for you? Why the number of samples (that you can multiply by the number of "sample after" and get good approximation for the number of events happen) your get from VTune is less valuable for you?
With regard to the issue of number of CPU cycles, let me explain as this:
If the sample after value is 5000, and a segment of my codes take less than 5000 CPU cycles to execute, it is possible that none of the lines of that segment of codes generates a sample. If I still want to tune this part of codes thoroughly, it's possible that I can't get enough information from the activity results. Sometimes such segments of codes do exists. For example, in a speech signal processing algorithm, a part of codes don't need too many cycles to execute itself, but when it is put into several layers of nested loops, the accumulated CPU time will be considerable. At the same time, the time complexity of this small part of codes can vary in different stage of the nested loops (due to different level of data being processed). So it's necessary to get the sampling result for this part at a particular time while it is being executed many times in the nested loops.
On the other hand, if I set the sample after value to a value rather small, 50 for example, the memory and disk consumption will be huge because it collect information for each process running on the OS. That's why I feel data collecting for all the processes all the time unnecessary. Indeed the sampling process has very low level of intrusiveness. But such a full-scale operation is very heavy in memory and disk space consumption. So that's why I think the TI's way will greatly add to VTune's functionality. The present set of functions of VTune is outstanding. But the ability to only profile the selected parts of codes (including segments and functions) and to generate very fine results will be a great plus. At least I feels this way.
Thanks for your rich interests and patience on my questions and suggestions. I hope my explanation will help you to understand my point of view. But I don't know whether I have expressed it clearly - in English. :)
Thank you very much for clear explanation of your mode of work.
I believe that you suggest the workaround by yourself right inside your text!
You say "consumes a lot of CPU cycles when put inside loops as well on a particular data sets" (sorry for rephrasing).
How do you feel about putting your piece of algorithm inside some cycle executed a big number of times (even artificially) and feeding it with the special data sets that heavily executes this part of the algorithm? Will you be able to get enough sampling inside your code, even with moderate "sampling after" value?
How about rising the priority of the process/thread high enough to enlarge it's time being inside the CPU?
May be you can also shut down all unnecessary processes while you do your measurements?
First of all I absolutely agree with you, that in many cases limiting data generation only to the specified user process has advantages above system-wide information gathering. But - do not forget, that Sampling does its work thanks to the special processor's hardware support. From the processor's point of view their is no such entity - process. It is just executing instructions. The only way to support per-process sampling data gathering is to monitor context-switches at the OS kernel level and change processor's hardware settings depending on current active process. As far as I recall some years ago we asked Microsoft for such kernel call-back support and they added it in Microsoft Windows 98 and Microsoft Windows NT 4.0 SP3, but once more removed it in Microsoft Windows 2000 and XP. They attributed this removal by explaining us, that supporting such call-back required many additional clock cycle and this significantly increased their context-switch overhead, so slowing the whole system.
As Daniel has pointed out, these work-arounds do help me to get what I want, though they need some time to find the right sample after value and proper data set. And kdmitry's explanation is of great value. I also understand Intel's position, though I don't have enough related tech background to describe it clearly. However, TI is different. It has everything under its control: the hardware, simulator on PC, evaluation module board and the CCS IDE. But between VTune and Intel's CPU - although these CPUs are made by Intel itself - there is an operating system called Windows. Something becomes hard due to this, as kdmitry has pointed out.
Anyway, without any question VTune is the best profiler on x86 platform. I enjoy using it.