Getting elapsed time per function on parallel aplication

Albert_F_ · ‎10-06-2014

Hi,

I'm only able to get CPU time, but I need elapsed / wallclock time on each function. The applications is parallelized using OpenMP.

It will be perfect if it can be obtained through command line interface.

regards,

TimP · ‎10-06-2014

A step in that direction is to filter by threads and attempt to find the views showing the maximum time in each function. I'd hate to attempt that from command line.

Dmitry_P_Intel1 · ‎10-07-2014

Let me understand the need better. Are the functions you are going to measure called inside a parallel region in OpenMP working threads or it is more like global application phases?

Albert_F_ · ‎10-07-2014

Second scenario: global application phases.

void phase( void )
{
	struct timeval start, end;

	gettimeofday( &start );

	pre_process();

	#pragma omp parallel
	{
		/* here it goes the main work */
	}

	post_process();

	gettimeofday( &end );
}

So, I'm able to get the aggregated time threads have spent at the parallel region (CPU time) but what I need is end - start.

Dmitry_P_Intel1 · ‎10-07-2014

There is Frame ITT API https://software.intel.com/en-us/node/496605 that VTune supports - it allows you to generate frames and then using Frame based groupings in grid explore elapsed time and other frame info. Frame is essentially a global time region with begin, end and name.

Also please note that starting in VTune Amplifier XE 2015 and Intel Compiler 14 and later you can have automated annotations for OpenMP regions and explore on OpenMP efficiency as it describled here: https://software.intel.com/en-us/node/529272

Regards, Dmitry

McCalpinJohn · ‎10-07-2014

I think I started asking for built-in instrumentation of OpenMP parallel regions before the standard was even officially launched in 1997. :-(

Instead of waiting, I decided to just get in the habit of building my own. Inside that parallel region, every thread begins by reading a timer (typically RDTSC on recent processors with the constant_tsc attribute, but gettimeofday() is fine on most systems) and saving it in a "start time" array (indexed by thread number). When each thread is finished with its work (but before it enters any implicit barriers), it reads the timer again and saves it in an "end time" array. Then I can look at the variation in start times, the variation in elapsed times, the variation in end time, etc.

Mildly labor-intensive, but valuable -- and once you have done it once, it is pretty easy to get in the habit of including such instrumentation any time you write OpenMP code.

Peter_W_Intel · ‎10-09-2014

In general speaking, you can get performance data thread by thread. For example:

#amplxe-cl collect concurrency -- ./program

#amplxe-cl -report hotspots -group-by thread

Thus, all performance data on threads will be displayed.

If you want to know CPU time for specific OpenMP* region, simply use VTune's pause/resume API before/after OpenMP code region. Thus, elapsed time is from first thread's creation and last thread's termination.

Peter_W_Intel · ‎10-09-2014

If you want to know appregated CPU time for specific function (pthread or winthread) which is used by many threads (as entry function), you can insert resume api before first thread' creation and put pause api after last thread's termination. Elapsed time is what you want - specific function's life time in threads. (Note: you may start VTune in start-paused mode)

Albert_F_ · ‎10-20-2014

Thanks for all the replies but for me it seems impossible to do it easily with Vtune.

At the end, I've chosen Extrae/Paraver that use dynist to automatically instrument entries and exits for a given set of functions. Then it is easy to extract elapsed time and hardware counters.

Thanks again.

McCalpinJohn · ‎10-20-2014

I was going to recommend compiling with the "profile-functions" option, but looking at the documentation for the Intel 14 compiler I noticed that it says

This option inserts instrumentation calls at a function's entry and exit points within a single-threaded application to collect the cycles spent within the function to produce reports that can help in identifying code hotspots.

I have not tested this to see if it is ignored when compiling for an OpenMP target.