The power of I7 and Parallel Programming

Alexander_Agathos · ‎11-25-2009

Just a demo of the power of an I7-920 for Parallel Programming. With some more code optimization 6secs can surely become 3secs:

http://users.iit.demokritos.gr/~agalex/ParallelTime.html

Cheers,
Alexander Agathos.

mahmoudgalal1985 · ‎11-26-2009

Quoting - agalex

Just a demo of the power of an I7-920 for Parallel Programming. With some more code optimization 6secs can surely become 3secs:

http://users.iit.demokritos.gr/~agalex/ParallelTime.html

Cheers,
Alexander Agathos.

Thank you...

Alexander_Agathos · ‎12-04-2009

Quoting - mahm85

Thank you...

Also there is a way to resort to GPGPU (CUDA) but CUDA has some limitations, it still can't beat the tremendous freedom that the CPU gives you. For instance in my Project I had to feed the processors with N Data Points (3D vertices). Where N can reach even a million. So this leads you to CUDA but there is a way to reduce the complexity to M which reaches maximum 300 Data Points. This leads you directly to the CPU and all the freedom it can give you. For example for 150000 points I was able to acquire the basic parts of the Stanford Asian Dragon :

http://www.iit.demokritos.gr/~agalex/StanfordDragon.jpg

All the work was done entirely in the CPU in about 50secs. Without Parallel Programming it would take 5-6 times more.

Regards,
Alexander.

Alexander_Agathos · ‎12-04-2009

Quoting - agalex

Also there is a way to resort to GPGPU (CUDA) but CUDA has some limitations, it still can't beat the tremendous freedom that the CPU gives you. For instance in my Project I had to feed the processors with N Data Points (3D vertices). Where N can reach even a million. So this leads you to CUDA but there is a way to reduce the complexity to M which reaches maximum 300 Data Points. This leads you directly to the CPU and all the freedom it can give you. For example for 150000 points I was able to acquire the basic parts of the Stanford Asian Dragon :

http://www.iit.demokritos.gr/~agalex/StanfordDragon.jpg

All the work was done entirely in the CPU in about 50secs. Without Parallel Programming it would take 5-6 times more.

Regards,
Alexander.

A few words on how to go Parallel : The basic question is how many threads. This is not an easy question. If you give the Processor just 8 threads then there is a strong chance that some of the threads will finish earlier leaving some of the CPU power not used for some time. So I would suggest to give the processor more threads than it can handle in Parallel at the same time. Lets say 16. I would say it is a safe number in my case, this decreases the probability of a Processor to go idle.

In my application I have a 3D Mesh. A 3D mesh represents a 3D object. This mesh consists of N 3D vertices which lie on the surface of the model and a set of triangles which connect the N vertices of the mesh.

In my Project the intense Part of computation is done for the computation of a function called Protrusion Function. This function for a vertex v of the mesh is defined as pf(v) = g(v,bi)Area(bi), (i=1..M) where bi is a representative point (or base point) of the mesh and Area(bi) is the area of the mesh which is associated to the representative bi. This area is the sum of the areas of the triangles of the mesh which are enclosed in this area. g(v,bi) is the geodesic distance between the points v,bi. A geodesic distance between two points is a distance like the Eucledian distance, though it is defined as the length of the shortest curve on the mesh that connects the two points. In discrete form this length can be quickly computed by the shortest path that connects the two points treveling along the edges of the triangles (Dijkstra algorithm).

First Parallel function : Find the M representative salient points->The N points are divided into TT groups of size ceil(N/TT) (where TT represent total number of threads). There is no need to worry, the actual Number of points that each processor handles is far less than ceil(N/TT). So by this way bi, area(bi) are acquired.

Second Parallel Function Compute the Protrusin Function: Here one has to think that there is a need to find all the geodesic distance from each of the representative points bi to all the vertices of the mesh. So here the M representatives are divided in TT groups of ceil(M/TT) size. Each thread finds for each of its base points the geodistances and adds them to a SUM accumulator of the form DAPXY (Y = aX + Y). This can be done very quicky by a SIMD intrinsic, I have done it prefetching memory to the cache of the Processor (to tell you the truth I used the L1 Cache since it is very fast).

So this is what I did in a nutshell. I did some more things but with the same nature of Parallelism in mind. For Scientific Details of the algorithm see the relevant bibliography.

Regards,
Alexander Agathos.

jimdempseyatthecove · ‎12-05-2009

Alexander,

>>If you give the Processor just 8 threads then there is a strong chance that some of the threads will finish earlier leaving some of the CPU power not used for some time. So I would suggest to give the processor more threads than it can handle in Parallel at the same time. Lets say 16. I would say it is a safe number in my case, this decreases the probability of a Processor to go idle.

That may be one solution, but it is not optimal due to unnecessary context switching by the O/S and the consequences of ineffective cache use by managing 16 threads, each with separate stack,as opposed to 8 in the mentioned scenario.

When programming in OpenMP you can obtain essentially do the same thing as, but with better performance than,over-subscription by using schedule(static, chunkSize) where chunkSize is the iteration space size / 16 in this example. In this case the iteration space is divided into the number of slices as what your over-subscription had, except now you have eliminated the context switch overhead and reduced the cache requirement for stack variables by 1/2. You still have the problem that slices are distributed to specific threads. Meaning should one of your threads be pre-empted, it will finish late and potentially blocking the other threads in your app for the duration of the skew time of completion by the preempted thread(s). This occures at the barrier at the end of the parallel loop.

When programming in OpenMP you can obtain do better than schedule(static, chunkSize) by using schedule(dynamic, chunkSize) where chunkSize is the iteration space size / 16 or largerin this example. In this method, you have the same advantage as for the schedule(static, chunkSize) plus the advantage of when the preemption(s) cause a skew large enough such that the skew is longer than the completion time for slice by other thread, then other thread(s) can takeover the slice that, for the static schedule, would have been destined to be run on the preempted thread. Additionally, by using smaller chunk size you can reduce the barrier block time (due to skew) somewhat, but this has diminishing returns.

The use scheduling (either static or dynamic) does come with cost, but this cost will almost always be less than the cost of over-subscription to the equivilent number of threads.

The other type of parallel programming is to use a task based paradigm either Threading Building Blocks, or QuickThread provides tasking capabilities. Both TBB and QT are immune from thread over-subscription, with QT having better asynchronous capabilities than TBB. Note, although OpenMP 3.0 has parallel taskq and task this task structure is not fully asynchronous nor immune from thread over-subscription.

Jim Dempsey

Alexander_Agathos · ‎12-05-2009

Quoting - jimdempseyatthecove

Alexander,

>>If you give the Processor just 8 threads then there is a strong chance that some of the threads will finish earlier leaving some of the CPU power not used for some time. So I would suggest to give the processor more threads than it can handle in Parallel at the same time. Lets say 16. I would say it is a safe number in my case, this decreases the probability of a Processor to go idle.

That may be one solution, but it is not optimal due to unnecessary context switching by the O/S and the consequences of ineffective cache use by managing 16 threads, each with separate stack,as opposed to 8 in the mentioned scenario.

When programming in OpenMP you can obtain essentially do the same thing as, but with better performance than,over-subscription by using schedule(static, chunkSize) where chunkSize is the iteration space size / 16 in this example. In this case the iteration space is divided into the number of slices as what your over-subscription had, except now you have eliminated the context switch overhead and reduced the cache requirement for stack variables by 1/2. You still have the problem that slices are distributed to specific threads. Meaning should one of your threads be pre-empted, it will finish late and potentially blocking the other threads in your app for the duration of the skew time of completion by the preempted thread(s). This occures at the barrier at the end of the parallel loop.

When programming in OpenMP you can obtain do better than schedule(static, chunkSize) by using schedule(dynamic, chunkSize) where chunkSize is the iteration space size / 16 or largerin this example. In this method, you have the same advantage as for the schedule(static, chunkSize) plus the advantage of when the preemption(s) cause a skew large enough such that the skew is longer than the completion time for slice by other thread, then other thread(s) can takeover the slice that, for the static schedule, would have been destined to be run on the preempted thread. Additionally, by using smaller chunk size you can reduce the barrier block time (due to skew) somewhat, but this has diminishing returns.

The use scheduling (either static or dynamic) does come with cost, but this cost will almost always be less than the cost of over-subscription to the equivilent number of threads.

The other type of parallel programming is to use a task based paradigm either Threading Building Blocks, or QuickThread provides tasking capabilities. Both TBB and QT are immune from thread over-subscription, with QT having better asynchronous capabilities than TBB. Note, although OpenMP 3.0 has parallel taskq and task this task structure is not fully asynchronous nor immune from thread over-subscription.

Jim Dempsey

Thanks for the advice. OpenMP is a great solution. The problem is that I want to make it independent from libraries that not many people can maintain since this project I am involved needs to be completely cross platform with minimalistic use of libraries. Maybe I can construct a simple scheduler that can feed the threads to the CPU, working only with 8 threads or as many threads the processor can handle. I will look at this matter with the options that the threading library I am using provides. Thank you again for your kind advice it was very helpful.

Regards,
Alexander Agathos.

Alexander_Agathos · ‎12-05-2009

Quoting - agalex

Thanks for the advice. OpenMP is a great solution. The problem is that I want to make it independent from libraries that not many people can maintain since this project I am involved needs to be completely cross platform with minimalistic use of libraries. Maybe I can construct a simple scheduler that can feed the threads to the CPU, working only with 8 threads or as many threads the processor can handle. I will look at this matter with the options that the threading library I am using provides. Thank you again for your kind advice it was very helpful.

Regards,
Alexander Agathos.

As I have seen it is very very easy. I am now using QT as the only libray that should be maintained. It supports threads and has some nice signals and boolians for termination. So a queue of Threads can be constructed and the processor can be fed when a thread has finished its execution. It is extremely simple. Thank you for your advice in the thread number.

Or just an array of Threads is ok. It goes like this

NP = Number of Processors
NT = Number of Threads
TA = Array Of Pointers to the threads (Size NT)
TAconc = Array of Pointers to the Threads that can run concurrently (Size NP)

for (i = 0; i < NP; i++) TAconc = TA
for (i = 0; i < NP; i++) start(TAconc)
Intialize a Thread Counter (TC) to NP
while (TC < NT || at least one thread isrunning)
for (i = 0; i < NP; i++)
if finished(TAconc)
if (TC < NT){
delete(TAconc);
TAconc = TA[TC++]
Start(TAconc)
}
for (i = 0; i < NP; i++) delete(TAconc)

I think it looks safe and everything finishes smoothly.

Regards,
Alexander Agathos.

michele-delsol · ‎12-05-2009

Quoting - jimdempseyatthecove

Alexander,

>>If you give the Processor just 8 threads then there is a strong chance that some of the threads will finish earlier leaving some of the CPU power not used for some time. So I would suggest to give the processor more threads than it can handle in Parallel at the same time. Lets say 16. I would say it is a safe number in my case, this decreases the probability of a Processor to go idle.

That may be one solution, but it is not optimal due to unnecessary context switching by the O/S and the consequences of ineffective cache use by managing 16 threads, each with separate stack,as opposed to 8 in the mentioned scenario.

When programming in OpenMP you can obtain essentially do the same thing as, but with better performance than,over-subscription by using schedule(static, chunkSize) where chunkSize is the iteration space size / 16 in this example. In this case the iteration space is divided into the number of slices as what your over-subscription had, except now you have eliminated the context switch overhead and reduced the cache requirement for stack variables by 1/2. You still have the problem that slices are distributed to specific threads. Meaning should one of your threads be pre-empted, it will finish late and potentially blocking the other threads in your app for the duration of the skew time of completion by the preempted thread(s). This occures at the barrier at the end of the parallel loop.

When programming in OpenMP you can obtain do better than schedule(static, chunkSize) by using schedule(dynamic, chunkSize) where chunkSize is the iteration space size / 16 or largerin this example. In this method, you have the same advantage as for the schedule(static, chunkSize) plus the advantage of when the preemption(s) cause a skew large enough such that the skew is longer than the completion time for slice by other thread, then other thread(s) can takeover the slice that, for the static schedule, would have been destined to be run on the preempted thread. Additionally, by using smaller chunk size you can reduce the barrier block time (due to skew) somewhat, but this has diminishing returns.

The use scheduling (either static or dynamic) does come with cost, but this cost will almost always be less than the cost of over-subscription to the equivilent number of threads.

The other type of parallel programming is to use a task based paradigm either Threading Building Blocks, or QuickThread provides tasking capabilities. Both TBB and QT are immune from thread over-subscription, with QT having better asynchronous capabilities than TBB. Note, although OpenMP 3.0 has parallel taskq and task this task structure is not fully asynchronous nor immune from thread over-subscription.

Jim Dempsey

Jim,

Thanks for your excellent post. I'd just like to add my 2 cents worth.

When I look at OpenMP I am struck by the ease with which it can be implemented. The drawback of course is lack of flexibility, all sorts of hidden pitfalls, and a direct link between parallelization and threads which can lead to considerable overhead. The programmer must make a decision as to how to manage the number which raises the problem of under/over subscription and load balancing. If the application is CPU intensive and does not have significant locks nor I/O and the programming effort is warranted, then TBB could be a considerably better choice. TBB'srecursive task decomposition serving to fill a reservoir of tasks which the Scheduler allocates to threads as they become free to do more work could indeed lead to much better performance than OpenMP C++ is assumed of course.

One more thing - OpenMP and TBB are not exlusive - it is entirely feasible to mix the two in the same application - OpenMP for the straight forward and simple stuff, TBB for the more complex CPU intensivestuff. And there are lots of real life situations where OpenMP cannot be used to parallelize whereas it would be feasible with TBB.

Michle

TimP · ‎12-05-2009

Quoting - jimdempseyatthecove

When programming in OpenMP you can obtain do better than schedule(static, chunkSize) by using schedule(dynamic, chunkSize)

Core i7, where all threads share a single last level cache, gives you a lot of flexibility for dynamic scheduling. Unfortunately, multiple package (socket) NonUniformMemoryAccess platforms are more critical in this respect. schedule(guided[,chunk]) compromises by setting static scheduling for the first 50% of iterations of parallel do|for and dynamic scheduling with progressively decreasing chunks for the remaining iterations. Still, my experience on the new dual 6 core platforms shows no gain beyond 8 threads under guided or dynamic scheduling.
In the common situation where the number of inner loop iterations for each outer loop is known in advance (e.g. operation on triangular matrix) it has been shown worth while to divide up the total work evenly among chunks of size determined in advance (the same result you would be aiming for with omp collapse and default static schedule, when that option is valid). Then it is possible to have each thread work on one contiguous block of memory, optimizing cache usage. Unfortunately, although there must be published references, I haven't found them, and it doesn't seem to correspond with simple programming idiom. On the NUMA platforms, good performance still depends on a consistent parallel first touch data initialization to minimize non-local memory references.
As for mixing OpenMP and TBB, I haven't seen a released method which allows their parallel regions to overlap. TBB would have to be used outside OpenMP or auto-parallel parallel regions. This is a recognized opportunity; possibly, a future enhancement of TBB might allow it to work in nested OpenMP parallel mode. At present, the same limitation appears to apply to Ct, as it employs TBB.
An important criterion for judging success of balancing work among threads by scheduling is minimizing barrier time associated with imbalance. The statistics presented by libiomp5_prof library are useful in this respect.

jimdempseyatthecove · ‎12-05-2009

Michle,

Although you could mix OpenMP and TBB (and QuickThread for that matter), I would suggest this be done only as a interrum measure while conversion of the program from one paradigm to another. The reasons being similar to the oversubscription arguments of my earlier response. If you have relatively long runs in each programming paradigm (on the order of seconds) then you might not suffer too much.

Jim Dempsey

project12kilo · ‎12-14-2009

Quoting - agalex

I like this forum and I like to recommend using Ubuntu Desktop 9.10 for all your computing needs. I am volunteering to promote Ubuntu 9.10 Desktop. Thanks for your attention.

Alexander_Agathos · ‎12-19-2009

Thank you.

TimP · ‎12-20-2009

An example for pre-calculating nearly equal numbers of operations for each thread, in a triangular matrix operation
(netlib vectors, s141, f2c translation):

i__2 = *n;

[cpp]#pragma omp parallel
      {
      float avgchunk=(i__2+1)*i__2/2;
      int nt;
      
#if defined _OPENMP
          nt=omp_get_num_threads();
// fair number of array elements per thread
          avgchunk=avgchunk/nt;
#else
          nt=1;
#endif

#pragma omp for
      for(int m= 1;m <= nt; ++m){
          int jbeg=sqrtf(.25f+2*(m-1)*avgchunk)+1;
// closest approximation to targeted chunk size
          int jend=sqrtf(.25f+2*m*avgchunk);
          for (j = jbeg; j <= jend; ++j) {
              int k = j * (j - 1) / 2;
              for (int i__ = 1; i__ <= j; ++i__)
                  cdata_1.array[i__ + k - 1] += bb[i__ + j * bb_dim1];
            }
          }
        }[/cpp]

This is compiled efficiently with inner loop vectorized, outer loop threaded parallel, by several compilers, including gcc, icc, Sun C99, and presumably others. Core i7 performs well without going to all this effort, but the time spent on the sqrt is easily recovered when the problem is large enough to gain from combined threading and vectorization.

ninhngt · ‎12-29-2009

Quoting - agalex

As I have seen it is very very easy. I am now using QT as the only libray that should be maintained. It supports threads and has some nice signals and boolians for termination. So a queue of Threads can be constructed and the processor can be fed when a thread has finished its execution. It is extremely simple. Thank you for your advice in the thread number.

Or just an array of Threads is ok. It goes like this

NP = Number of Processors
NT = Number of Threads
TA = Array Of Pointers to the threads (Size NT)
TAconc = Array of Pointers to the Threads that can run concurrently (Size NP)

for (i = 0; i < NP; i++) TAconc = TA
for (i = 0; i < NP; i++) start(TAconc)
Intialize a Thread Counter (TC) to NP
while (TC < NT || at least one thread isrunning)
for (i = 0; i < NP; i++)
if finished(TAconc)
if (TC < NT){
delete(TAconc);
TAconc = TA[TC++]
Start(TAconc)
}
for (i = 0; i < NP; i++) delete(TAconc)

I think it looks safe and everything finishes smoothly.

Regards,
Alexander Agathos.

OpenMP is not a library. Almost all modern compilers support it. Just having to turn on a switch in the compiler is not too much hard. It saves you a lot of manual thread hand coding and gives more optimal performance.

Alexander_Agathos · ‎12-30-2009

Quoting - ninhngt

OpenMP is not a library. Almost all modern compilers support it. Just having to turn on a switch in the compiler is not too much hard. It saves you a lot of manual thread hand coding and gives more optimal performance.

http://msdn.microsoft.com/en-us/library/tt15eb9t%28VS.80%29.aspx

It is all in : vcomp.lib

Happy New Year,
Alexander.

TimP · ‎12-30-2009

Quoting - agalex

It is all in : vcomp.lib

As I take the meaning of the earlier comment, it's not practical to implement OpenMP in a library alone. The functions implemented in vcomp.lib don't have a publicly documented correspondence with OpenMP standard; we depend on VC to implement translations. Intel libiomp5 includes a larger group of functions, including those in vcomp and those called by ICL and ifort /Qopenmp. Both vcomp and libiomp5 (Windows version) rely on Windows thread library calls, so the OpenMP library is only a layer between the compiler and the native thread library. libiomp5 linux version implements many of the same functions, as well as the libgomp ones, on top of pthreads.

Alexander_Agathos · ‎12-30-2009

Quoting - tim18

As I take the meaning of the earlier comment, it's not practical to implement OpenMP in a library alone. The functions implemented in vcomp.lib don't have a publicly documented correspondence with OpenMP standard; we depend on VC to implement translations. Intel libiomp5 includes a larger group of functions, including those in vcomp and those called by ICL and ifort /Qopenmp. Both vcomp and libiomp5 (Windows version) rely on Windows thread library calls, so the OpenMP library is only a layer between the compiler and the native thread library. libiomp5 linux version implements many of the same functions, as well as the libgomp ones, on top of pthreads.

Great then. Of course, OpenMP should be this layer. OpenMP is a very good candidate and it supports Windows/Linux/Mac OS. The three platforms I want to officially support in a Client - Server application I am creating. The Server will reside on an Intel Based Computer, the client will reside on a Smartphone (Symbian OS) just using its OpenGL ES and Communication (3G) capabilities (nothing serious on heavy computation for this, the processor is just an ARM). The project will finish as soon as QT decides to provide us with the QtOpenGL library for the Symbian OS. Believe it or not we can't work yet in 3D using QT for the Symbian platform.

Alexander_Agathos · ‎03-11-2010

You can see my technical note on Parallel Processing for my Project and how it can reduce dramatically the computational time and also for those in Computer Graphics Segmentation how you can use existing resources in order to reduce the complexity in the computation of the Salient Points which is an extremely time consuming operation on large 3D meshes if not dealt properly.

Link: http://users.iit.demokritos.gr/~agalex/TechnicalNote.pdf

Alexander_Agathos · ‎04-27-2010

If you would like to experiment with the Program and see the scalability over Processor genres or on different types of I7 Processors here is the Executable in Windows: Protrusion Based Segmentation

Note: It can deal with Meshes around 250000 points. From 160000 points it goes in GPU mode for some calculations so you need to have a CUDA enabled Graphics Card.

Also since it is coded in VS you need to have the Microsoft Visual C++ 2008 Redistrutable Package installed.