Mixture of data and task parallelization in TBB

Nishikawa__Mitsuru · ‎12-06-2018

Dear all,

I am now using TBB for shared-memory parallel realization, and new to TBB.

When I read the O' Reilly book written by James Reinders,

I found that TBB is capable easily to mix task and data parallelization.

Does this mean that we can usually implement

parallel algorithms (c.f. parallel_for etc) in nested structure regardless of core number?

Thank you in advance.

Mitsuru Nishikawa

Alexei_K_Intel · ‎12-06-2018

Dear Mitsuru,

Thank you for your question.

With TBB, usually, you do not need to worry about the number of threads/cores. In addition, you can nest TBB parallel algorithms as you want. I can provide more details if you describe your desired usage model.

Regards,

Alex

Nishikawa__Mitsuru · ‎12-06-2018

Dear Alex,

Thank you for your reply.

I am just reading about mechanism of work stealing etc, and amazed to know how to realize.

Excellent library, I appreciate your kind answer.

Kind regards

Mitsuru Nishikawa

Alexei_K_Intel · ‎12-06-2018

Dear Mitsuru,

You are welcome. If you have any additional questions, do not hesitate to ask.

Regards,

Alex

Nishikawa__Mitsuru · ‎12-06-2018

Dear Alex,

If you do not mind and if you know, could you tell me additional question.

I recently know that intel TBB deals with heterogeneous parallel computing (CPU, GPU etc).

In that situation, does this trait still remain?

Kind regards,

Mitsuru Nishikawa

Alexei_K_Intel · ‎12-07-2018

Dear Mitsuru,

Surely, I will answer your questions.

TBB does not use GPU and other devices directly for computations. However, it can be used to organize heterogeneous computations and synchronize them with CPU computations. For example, tbb flow graph functionality can be used to synchronize CPU workloads with external devises (e.g. GPU) with the help of tbb::flow::async_node. In addition, you may want to consider tbb::flow::opencl_node that allows executing OpenCL kernels as a part of flow graph execution.

Regards,

Alex

Nishikawa__Mitsuru · ‎12-09-2018

Dear Alex,

Thank you for your kind replies, and I deeply appreciate them.

I conceptually understand it in the heterogeneous computing.

Currently, I tried data and task parallelization by nesting parallel_for.

However, I merely acheive 30% CPU usage rate as task manager shows it.

(I heard that TBB adopts greedy scheduler so that I believe that it acheives nearly 100%)

Is there technique or the art to achieve higher scalability?

Alexei_K_Intel · ‎12-09-2018

Dear Mitsuru,

Unfortunately, the parallel computing is an engineering technology that is difficult to theorize. While there are multiple works on parallel pattern and techniques (e.g. by James Reinders, parallelbook.com), the resulted performance depends on multiple aspects of a particular system and application (a CPU type, a workload type and so on). If you would like to investigate your particular case, could you, please, share an algorithm description (or a code snippet) and platform type (e.g. the number of cores/threads)?

For example, 100% might not be achieved if there are not enough computational work and TBB worker threads are sent to a sleep state.

Regards,
Alex

Nishikawa__Mitsuru · ‎12-10-2018

Dear Alex,

Thank you for your rapid reply.

I understand techniques of parallel computing depends on several factors.

Now I use intel Core i5(2 cores and 4 logical processers).

Unfortunately, I cannot show actual codes but I exemplify a psuedo code,

by which you could give me a comment, I deeply appreciate it.

============================================================================

using namespace std;

vector<ClsA> As; // in reality, this is a member variable

size_t num = As.size();

// data parallelization

tbb::paralle_for(tbb::blocked_range<size_t>(0, num), [&](const tbb::blocked_range<size_t>& r){

for(size_t i = r.begin(); i != r.end(); ++i){

As.memFuncA(); // some procedures

}

);

Then, As have some std::vector data (like coordinates, or properties etc),

and several element-wise operations of them are done in memFuncA.

I first implement the member functions in sequential manner, and then I get 30% CPU usage (Windows task manager).

Then, I implement them by using tbb, but performance is not changed.

The size of As is larger enough than CPU core number, so that I at first expect that data parallelization(just sequential) is enough.

If you have a comment, I would appreciate it if you could give me a advice.

Alexei_K_Intel · ‎12-11-2018

Dear Mitsuru,

Can I ask to provide additional information to better understand your use case?

How long is the application? Have you tried to measure the execution time of a particular loop invocation with and without parallelization (e.g. with std::chrono or tbb::tick_count)? I am asking because Windows task manager is not accurate way to measure the performance. Usually, it is better to get time points before and after the code block to understand the performance.

Do you use any other TBB functionality or other parallel libraries (e.g. OpenMP)?

Regards,
Alex

Nishikawa__Mitsuru · ‎12-11-2018

Dear Alex,

> Can I ask to provide additional information to better understand your use case?

Though I cannot tell you the detail,

my use case is physical simulation like N-body simulation.

<https://en.wikipedia.org/wiki/N-body_simulation>

Each particle behave independently so that their motions (ODE's solution) seems suitable to parallel computations.

(As the example of the former comment, memfuncA is for ODE with some procedures, but each tasks are independent, they does not touch other data of particles)

Thus, computation is of time-iteration of the same procedure.

Though I have not measured each blocks of subroutines,

I measure cpu-time per a hundred of iterations by using std::chrono.

In my CPU spec (Core i5, 2 cores and 4 logical processors), approximately 2x speed up is confirmed,

but I suppose that I might make more performance.

Should I measure the performance suited to HPC parallel computing? (c.f. intel VTune Amplifier)

I now use only parallel_for and do not use other parallel libraries concurrently.

Kind regards

Mitsuru Nishikawa

Alexei_K_Intel · ‎12-19-2018

Dear Mitsuru,

It looks like my reply last week was lost for unclear reason. I am really sorry for that.

Thank you a lot for the details. As far as I know, N-body example is a compute intensive application that might not benefit from Hyper-Threading. Consider the blog article about the efficiency of Hyper-Threading on different applications (scroll down to the "thread-intensive workload" example).

Your CPU has only two cores, so 2x speed up seems a good result.

Surely, I would recommend using Intel VTune Amplifier to investigate if something can be improved.

Feel free to share your opinion and ask question if any.

Though I have not measured each blocks of subroutines,
I measure cpu-time per a hundred of iterations by using std::chrono.

It is a good approach because it neglects execution time deviations and initialization time (warm-up).

Regards,
Alex

Nishikawa__Mitsuru · ‎12-19-2018

Dear Alex,

Thank you for your replies every day, and I do not mind a bit delayed reply.

I understand that the performance I told you is reasonable, and realized a benefit from Hyper-threading is limited to case-by-case.

I deeply appreciate your helpful comment on the course of this thread,

Thank you very much for telling me.

Best regards,

Mitsuru