An obvious suspicion would be

mohanmuthu · ‎10-12-2013

I used /Qparallel compiler option, and observed some parallel execution. My code is intended to process lot of data. It was interesting to see that, when I use small data size, then I see higher degree of parallelism (from CPU usage) than if I use large amount of data. My understanding is, /Qparallel option parallelize the code only during compilation, and not during the execution. If not, please let me know, whether is there any dynamic memory limit which decides the extent of parallelization.

Also, can someone help me with the coding methods which could enable high degree of parallelism in /Qparallel?

TimP · ‎10-12-2013

An obvious suspicion would be that you have significant time spent in a portion of the program which didn't parallelize (e.g. according to /Qpar-report). You might need to use a profiler such as VTune or CodeAnalyst, or use system_clock to time individual sections.

If you have a parallel section which exhibits work imbalance (where some threads take signficantly more than average time), that also could produce this symptom. There you could use omp schedule(runtime) and try the various options, such as OMP_SCHEDULE=guided|auto|dynamic,2

If you have a dual CPU or HyperThreaded platform, setting KMP_AFFINITY appropriately could have useful effect.

Other than considering what you would need to do to make efficient use of OpenMP, and looking up the directives, I doubt much can be said about coding methods for /Qparallel. The usual methods apply of setting up multiple levels of loops such that the inner loop vectorizes and the threads can work on mostly independent data regions, and assuring that the loop which first accesses the data is parallelized consistent with the others (even if that loop is not time consuming by itself). It gets quickly to the point where you can progress faster with OpenMP, unless you have something which resembles closely the patterns which /Qparallel must recognize in order to get good SPECfp scores.

Bernard · ‎10-12-2013

You can use VTune front - end and back -end pipeline stalls to find out why there is decrease in performance when the data set grows larger.

How do you measure your application overall CPU load?

Bernard · ‎10-14-2013

This is follow up.In your case measuring front end pipeline stalls is more recommended than collectingnback end stalls metrics.

mohanmuthu · ‎10-14-2013

Thank you Tim, iliyapolak. Let me try your suggestions.

* I see resource monitor (from task manager to see the CPU load), but no quantitative one.

Bernard · ‎10-14-2013

Do you mean that you can not see per core load distribution ?

Bernard · ‎10-14-2013

I usually start the investigation of CPU performance by using Process Explorer to measure CPU load in CPU clock cycles which are charged to specific process.Task manager uses Interrupt clock to sample the CPU load and IP so it is not as accurate as PE.

How /Qparallel works?