Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Seunghwa_Kang
Beginner
43 Views

Measuring CPU usage

Hello,

Assuming the following code, I want to measure how much cpu time is spent on executing funcA and funcB, respectively. FuncA and FuncB can invoke (multiple) parallel_for routines.

[cpp]int main() {
	...
	parallel_invoke( []{ funcA(); }, []{ funcB(); } );
	...
}

void funcA( void ) {
	...
	parallel_for( ... ) {
		...
	}
	...
}

void funcB( void ) {
	...
	parallel_for( ... ) {
		...
	}
	...
}[/cpp]

What's the best way to do this (this does not need to be super accurate)? I googled a lot and what I have found is to use clock_gettime (with CLOCK_THREAD_CPUTIME_ID) or getrusage (with RUSAGE_THREAD) and use this function at the beginning and the end of every code block like the code below, but this becomes pretty ugly if there are multiple parallel_for routines (and those routines also invoke parall_for routinesin a nested fashion) and these are spread in multiple files.

[bash]void funcA( void ) {
	start;
	end;
	time = 0;

	clock_gettime( ..., &start );
	...
	clock_gettime( ..., &end );
	time += diff( end, start );

	parallel_for( ... ) {
		lStart;
		lEnd;
		clock_gettime( ..., &lSTart );
		...
		clock_gettime( ..., &lEnd );
		time += diff( lEnd, lStart );//atomic
	}

	clock_gettime( ..., &start );
	...
	clock_gettime( ..., &end );
	time += diff( end, start );
}[/bash]

Is there a better way to do this? Such as invoking some function just at the beginning and the end of funcA and funcB and get the same results...
0 Kudos
6 Replies
SKost
Valued Contributor II
43 Views

>>...but this becomes pretty ugly if there are multiple parallel_for routines (and those routines also
>>invoke parall_for routinesin a nested fashion)...

Doyou have some simpleloggingAPI to a text-file? If Yes, in case of nested\recursive calls you could use
a tabulation-like approach:

- Let's say you have a nested\recursive code\function and it calls itself many times;

- A simple call to 'printf( "Executed in %ld\n", uiTime)' doesn't help because all "records" will be aligned;

- Create a static variable:'static uint uiRecursiveLevel = 0';

- Increment 'uiRecursiveLevel' every timewhen function\codeenters a nested\recursive part of the code;

- Decriment 'uiRecursiveLevel' every timewhen function\codeleaves a nested\recursive part of the code;

- Use a currentvalue in 'uiRecursiveLevel' variable to create a substring of tabulation characters '\t'
and use the substring in a'printf' function ( a number of '\t' characters should be equal to a value in
'uiRecursiveLevel' variable);

- A logging output, for example after 3 nested\recursive calls,could look like:

...
Enter 0
Enter 1
Enter 2
Executed in 10 ms
Exit 2
Executed in 20 ms
Exit 1
Executed in 30 ms
Exit 0
...

In that case you could clearly see execution times for different parts of codes!

It also could be done with C++ and in that case all logging API calls have to be used in a constructor and destructor.Both ways aresimple to implement.

Best regards,
Sergey
Seunghwa_Kang
Beginner
43 Views

Thanks for the reply but umm... I may fail to explain my problem clear enough.

So my understanding is if I execute parallel_invoke( funcA, funcB ), these two routines will run in parallel (assuming that there is enough available worker threads). If funcA and funcB invoke parallel_for multiple times, the number of threads asssigned to execute funcA and funcB can change dynamically (via work stealing). FuncA can be (MPI message passing) latency sensitive and FuncB can be compute intensive, and the intention is to hide latency of FuncA by running FuncB when FuncA is waiting for MPI communication to be finished.

What I want to know is how much CPU load (how much time it will take if funcA is executed first---excluding MPI message waiting time---and funcB is executed after funcA is finished using the entire set of worker threads) is in executing funcA and funcB. This information is later used for load-balancing in MPI level. The relative amount of computing in funcA and funcB changes widely based on input data, several user settings, and even within a single computing as simulation advances, I should do this in run-time.

The problem is, as FuncA and FuncB is executed in parallel and the number of threads assigned to each function changes dynamically, I need to sum the aggregate CPU usage by all the threads assigned to each function while those threads are executing code for each function.

If these are not threads but processes, there is an API to compute the aggregate CPU usage by a process and all its child processes.

I wanted to find an API that can compute execution time for a thread and all its "child" threads, but I havn't found it, yet. And TBB does not spawn threads on every time encountering parallel_for but gets threads from a worker threadspool, even such an API exists, I may not able to use it without some modification.

I wonder if there is an elegant way to do this; if there is a function that measures the combined CPU usage of a certain region (something like an API that measures all the CPU usage by its child processes, but works for threads instead of processes, and with TBB's parallelization mechanism), this is the best. If not, I may need to do something I described as "ugly" :-( (increases code size and makes the codehard to understand and modify)

Thanks!!!
RafSchietekat
Black Belt
43 Views

"FuncA can be (MPI message passing) latency sensitive and FuncB can be compute intensive, and the intention is to hide latency of FuncA by running FuncB when FuncA is waiting for MPI communication to be finished."
Be warned that blocking in FuncA will cause the threads to be undersubscribed.

I'm not aware of any profiling support that accumulates the useful time spent in a function and takes into account work performed by other threads and excludes stolen work performed for unrelated functions.

I'm afraid you should give up on the idea of calling MPI from TBB, or going into much more detail than basically just walltime.
Seunghwa_Kang
Beginner
43 Views

Thanks for the reply.

I use wall clock time to measure load imbalance and cpu usage estimate is used to decide which block to move from one node to anotehr sowhat I need is just rough estimationbut seems like there is no simple way to do this.

And could you explain "blocking in FuncA will cause the threads to be undersubscribed" a bit more?

FuncA works

FuncA {
for( multiple iterations ) {
do some computing with parallel_for
do MPI communicaiton
}
}

Are you sayingFuncA will be descheduled on blocking for MPI message and it will not scheduled again and starve (even when the MPI message arrives) till FuncB finishes (even if the thread group for funcA has a higher priority than the group for FuncB)? Or you mean something else?

Thank you very much!!!
RafSchietekat
Black Belt
43 Views

The task scheduler is not aware of any thread being blocked. The thread will be scheduled again by the operating system when it is unblocked, but until that time one of the hardware threads (a core or one of its hyperthreads) will not participate in the processing, even if there is work ready to continue further down in the stack of the blocked software thread, potentially exacerbating the situation.

You're better off only invoking TBB from MPI and never the other way around.
Seunghwa_Kang
Beginner
43 Views

Got it and thanks!!! I think that sort of blocking is fine as I am wasting only one hardware thread CPU cycles (my systems have 24-64 threads per node andunlesseverything is perfectly balanced and synchronized,it is impossible to perfectly avoidwasting some CPU cycles).