Is there any tool or code to determine if the processors are being utilized by the threads?
I have this parallel code that is slower than its serial implementation. I just wanted to show that that parallel version is really parallel even though it is slower compared to its serial implementation.
The question is: Used, Mis-used, or Abused
For inspection and confirmation, as well as for discovery of mis-used or abused, insert a break point in your parallel region. Most multi-threading debuggers have a "Threads" where you can see the list of threads created by the application. The also have a means to suspend/resume a particular thread or group of threads. On MS WinDbg there is a Threads pane and if you right-click on a thread a pop-up lets you select "Freeze" or "Thaw" to suspend or resume a thread. You can use this to track (trace) execution through your parallel region thread at a time. Note, you may discover that all the threads are performing the same work as opposed to pieces of the work.
Typically when the code runs slower in parallel the usual reasons are
a) doing redundant work
b) adverse cache interaction between threads
c) the process is memory bound where even one thread is capable of stalling on memory access
d) programmer expectation that thread scheduling has 0 overhead.
Situation a) is a programming error
Situation b) can be improved by re-working the algorithm such as to seggregate the writes by each thread into seperate cache lines. And this can additionally be improved with data alignment. Consider a cache line as 64 bytes (but may soon double with newer processors).
Situation c) typically cannot be resolved with multiple threads but you can use it to your advantage if you can manage to write the code such that a compute intensive portion of one thread can coincide with the memory intensive action of another thread.
Situation d) is a matter of knowing what the overhead is and factoring that into your decision as to if or how to parallize the code.
Regardless of the parallization issue, first look at your code to see if it can take advantage of vectorization, and then see if your attempt at parallization affects the vectorization. If it does, then see if you can rework the parallization such that it does not affect the vectorization (e.g. run a loop in increments of 2, 4 or more and/or use of temp variables to improvevectorization).