I'm profiling different variants of a network benchmark (NetPIPE) while running it in loop back. One version is single threaded and uses standard blocking sockets. The other versions are multi-threaded (4 threads) and use non-blocking sockets. The difference in performance between threaded/unthreaded versions is pretty big which was to be expected, but I'm having trouble accounting for all the time in all cases.
One of the multi-threaded versions shows a 3x latency difference over non-threaded. This version uses pthread_cond_timedwait() to coordinate between threads. However, when I complete a run using sampling, both this and the non-threaded version report very close to the same number of Clockticks. I would have expected that the Clockticks in the Event Summary for the process view would have been 3x bigger with the extra time going to the Idle Pid if nothing else. The other multi-threaded versions that use a polling method instead of pthread_cond_timedwait() do come up with the expected larger Clockticks. Assuming the benchmark timing mechanism (which is the same in all versions) is accurate, can you speculate as to why VTune would report virtually the same number of Clockticks when the run takes 3x longer when using pthread_cond_timedwait()?
P.S. This is being run on RedHat 7.3 (2.4.18 SMP kernel) with VTune 2.0
With a bit of a red face, I should report that I understand the discrepancy. I had assumed that the setup time around the iteration loops of NetPIPE were trivial. I turns out this is not true and with the default iteration counts, this overhead swamps out the actual benchmark part of the test and thus total execution time was fairly constant. Once I increased the iteration count from the command line, all times and Tickcounts were in the expected proportions.
A little embarrassed,
Roy, no need for embarassment here, EVER! Thanks for the followup, these discussions are the bread and butter of what we do on the VTune team.
I was ready to post that the blocked time does not get attributed to the blocked process in Linux, it being multiuser and multitasking and always pretty busy.
But here's a conglomerate answer, what a few developers came up to answer your question, in case OTHER readers on this thread are curious for more:
For sampling, when pthread_cond_timedwait is called, and the block occurs, if there is a thread(s) in another process(es) waiting to execute, the time that the original thread spent blocked will be attributed to those other threads in which code begins executing, not the original blocked one (those newer threads are where the CPU is executing).We think the only condition that really needs to occur for this to happen is that there does have to be another process waiting to execute. It seems to us if there isnt, a process switch might not occur and the intervening idle time might still be attributed to the blocked process.
Message Edited by jdgallag on 02-26-2004 11:58 AM