1) mm_pause As I said,

diedler_f_ · ‎09-19-2017

Hi everyone,

I wonder if there is not a bug inside TTB library. I explain:

I have a long computation for 1 000 000 objects. For this, I use parralel_for paradigm like this:

size_t size = 1000000;
tbb::parallel_for(tbb::blocked_range<size_t>(0, size),
        [&](const tbb::blocked_range<size_t>& r) {
            for (size_t i=r.begin(); i != r.end(); ++i)
            {
                //long computation here
            }
}

When I launch the program, sometimes my 8 cores work at 100% => That's great !

But sometimes, I have a loss of "powerful" for my threads (i will provide a graphic chart of my CPU usage it will be clearer to understand).

Can you explain why ? Thanks :)

PS: I use Windows 10 with an Intel i7 (8 cores)

diedler_f_ · ‎09-19-2017

add image

Alexei_K_Intel · ‎09-19-2017

Hi,

What sort of "long computation" is performed? Is it pure math or does the algorithm perform some IO operations (read files, sockets and so on)? Do you use some synchronizations, third-party library calls and/or OS API inside the computations?

Regards,
Alex

diedler_f_ · ‎09-19-2017

Hi,

It is pure maths operations no IO or sockets. No api call. And there is no synchronisation. It is strange because sometimes all CPU work and sometimes not...

Any ideas ?

EDIT: I try to change partitioner but same result. Maybe I did some mistakes because I don't have an expert knowledge about this.

Alexei_K_Intel · ‎09-21-2017

Could you provide a complete reproducer and some related details about your environment and use case, please?

Regards,
Alex

diedler_f_ · ‎09-21-2017

Alex (Intel) wrote:

Could you provide a complete reproducer and some related details about your environment and use case, please?

Yes, I wrote a small example that reproduce the problem :

   size_t nbCombiFast = 100000;
    tbb::parallel_for(tbb::blocked_range<size_t>(0, nbCombiFast),
        [&](const tbb::blocked_range<size_t>& r) {
            for (size_t i=r.begin(); i != r.end(); ++i)
            {
                std::vector<float> v;
                float r = 0;
                for (size_t o=0; o<10000000; ++o)
                {
                    r += o*sin(o) - cos(o);
                }
                v.push_back(r); // to avoid compiler optimizations...
            }
        }
    );

My environment is :

MinGW v5.1 (with GCC 5.1 (tdm-1) with Thread model = posix)
Compiler options: Release mode with -O3 for optimizations
Windows 10 with Intel i7-6820HQ @ 2.70Ghz
I don't remember the TBB version, but the name of the folder I had compiled was "tbb2017_20161128oss". Because there is no binaries for MinGW compiler, so I had compiled myself the TBB library for MinGW (maybe the problem is here ?)

When I launch this example, sometimes all my cores work at 100% and sometimes it makes a long time to reach 100% ...

Thanks

jimdempseyatthecove · ‎09-25-2017

size_t nbCombiFast = 100000;
atomic<int64_t> hack = 0; // add to eliminate heap critical section
 tbb::parallel_for(tbb::blocked_range<size_t>(0, nbCombiFast),
     [&](const tbb::blocked_range<size_t>& r) {
         for (size_t i=r.begin(); i != r.end(); ++i)
         {
// remove    std::vector<float> v;
             float r = 0;
             for (size_t o=0; o<10000000; ++o)
             {
                 r += o*sin(o) - cos(o);
             }
// remove    v.push_back(r); // to avoid compiler optimizations...
             hack += (int64_t)r; // non-critical section
         }
     }
 );
if(hack) print "Won't print"; // avoid meaningless code elimination

If the above eliminates the symptom, then this indicates adverse interaction with heap.

Note, if you have TBB scalable allocator activated, Your former code may have had experienced excessive amounts of "first touch" delays as slabs are allocated, touched for initialization, and touched again as each vector grows (iow when allocated node changes size).

To confirm this, place your original code into a function, then call this function twice, with a 10 second sleep function between calls. I expect that your first call will exhibit the existing chart, and the second call will produce the expected chart.

Jim Dempsey

diedler_f_ · ‎09-25-2017

Same problem with your hack atomic variable instead of my std::vector... I put the code just to be sure there is no errors.

void func()
{
    tbb::atomic<int64_t> hack = 0;
    tbb::parallel_for(tbb::blocked_range<size_t>(0, 100),
    [&](const tbb::blocked_range<size_t>& r) {
        for (size_t i=r.begin(); i != r.end(); ++i)
        {
            float r = 0;
            for (size_t o=0; o<10000000; ++o)
            {
                r += o*sin(o) - cos(o);
            }
            hack += (int64_t)r;
        }
    });
    if (hack) std::cout << "Youpi !";
}

int main()
{
    //tbb::task_scheduler_init init(8);
    auto d = std::chrono::system_clock::now();
    func();
    auto e = std::chrono::system_clock::now();
    auto millis = std::chrono::duration_cast<std::chrono::milliseconds>(e - d).count();
    std::cout << "Milli = " << millis << std::endl;
    Sleep(10000);

    auto d2 = std::chrono::system_clock::now();
    func();
    auto e2 = std::chrono::system_clock::now();
    auto millis2 = std::chrono::duration_cast<std::chrono::milliseconds>(e2 - d2).count();
    std::cout << "Milli = " << millis2 << std::endl;

    return 0;
}

I also perform your 2nd test but not working. Sometimes the first call is better than the second and sometimes it is the 2nd that is better than the first. How can I do if I use TBB scalable allocator ?

I join a picture of my CPU when launching the program. At start, the CPU is at full speed (100%) and then I don't know why but the speed drops... and then increases again... The speed has not the time to reach 100% again because the computation is finished.

I really don't understand the problem. Have you got the same issue with your CPU ?

Thanks !

Alexei_K_Intel · ‎09-26-2017

Is the picture shown for the provided example? Where is the "Sleep(10000)" time (low CPU utilization)? It should be about 2-2.5 rectangles on the X-axis. What is the typical running "Milli = " times for the first and the second runs in the example?

Regards,
Alex

diedler_f_ · ‎09-26-2017

Hi,

No the graph shown is only for the first call because there is no place for the full bench in the graph...
Typically, func() takes between 21 000ms (best case when CPU is always at full speed) and 37 000ms (in the worst case). The average is about 24 000 ms.

I join another graph with the full bench (1st call and 2nd call). We have 10s between each call. The first call takes 23363ms and the second 36261ms. A big difference in this case.

If I launch again the bench 3 times I have:

1st call = 22512ms / 2nd call = 32042ms
1st call = 37233ms / 2nd call = 36134ms (no luck for this bench !)
1st call = 30162ms / 2nd call = 35332

Thanks

jimdempseyatthecove · ‎09-26-2017

Alex,

After looking at the chart in #10 I will make an educated guess at what might cause the symptom.

TBB, like virtually all well written multi-threading (w/ thread pool) system, tries to be nice to other processes on the system. To address this, when a (each) thread, after some period of time is unable to find work, it suspends itself. This suspension is typically performed on timed wait on a condition variable (pthread and std::thread/condition_variable, or Windows WaitForSingleEvent). The symptom for the second call is indicative of the TBB thread management code of .NOT. signaling the condition_variable or event for the other thread(s) when work becomes available. IOW the additional threads are not run immediately, but rather startup after the timer expires.

Note, this omission includes the situation whereby the main thread properly notifies one of the waiting threads, but that thread fails to properly notify the correct other threads. Potentially this could be the result of the second thread notifying itself (or other running thread) as opposed to notifying a waiting thread. IOW do not simply look at what the main thread does, but look deeper at what the woken-up threads do.

Jim Dempsey

diedler_f_ · ‎09-26-2017

Is there any solution to avoid that ? It is really bad for my application that needs full performance.

I do not want to rewrite an entire thread pool for my application...

jimdempseyatthecove · ‎09-26-2017

This is something that needs to be fixed inside TBB. About all you can do now is to keep your threads busy. Until a fix comes in, as a hack (crude hack), is to schedule number of threads in the pool -1 number of low priority tasks that sleeps 1ms (or shorter), checks a program termination flag (you set this at end), if termination not indicated the task schedules itself on the low priority queue then exits.

Presumably, when your program is running "hot", these tasks will not get dequeued. Only when threads enter stealing mode with nothing else to do, will it take one of these tasks. You will need to post a reminder to remove this when the TBB library gets fixed.

Jim Dempsey

diedler_f_ · ‎09-26-2017

jimdempseyatthecove wrote:

This is something that needs to be fixed inside TBB. About all you can do now is to keep your threads busy. Until a fix comes in, as a hack (crude hack), is to schedule number of threads in the pool -1 number of low priority tasks that sleeps 1ms (or shorter), checks a program termination flag (you set this at end), if termination not indicated the task schedules itself on the low priority queue then exits.

I am sorry but I don't understand your solution. Can you provide a short piece of code ?

jimdempseyatthecove wrote:

You will need to post a reminder to remove this when the TBB library gets fixed.

I am looking forward seeing this fix :)

Thanks

jimdempseyatthecove · ‎09-28-2017

Note, untested code, crude hack, you are welcome to improve on this


#if defined(USE_LowPrioritySpinTask)
class MyLowPrioritySpinTask : public tbb::task {
    static bool terminateFlag = false;
    static atomic<int> terminated = 0;
    /*override*/ tbb::task* execute() {
        if(terminateFlag)
        {
           ++terminated;
           return NULL;
        }
        tbb::task::enqueue(this, tbb::priority_t::low); // re-queue our task at low priority
        return NULL
    }
  public:
    void terminate() { terminateFlag = true; }
};
#endif

...
#if defined(USE_LowPrioritySpinTask)
int nSpinnerTasks = YourTBBworkerPoolSizeYouDetermineThis() - 1;
vector<MyLowPrioritySpinTask *> vMyLowPrioritySpinTasks;
for(int i=0; i<nSpinnerTasks; ++i)
{
    MyLowPrioritySpinTask * t = new (tbb::task::allocate_root()) MyLowPrioritySpinTask ();
    tbb::task::enqueue(*t, tbb::priority_t::low);
    vMyLowPrioritySpinTasks.push_back(t);
}
#endif

doYourProgramHere();

#if defined(USE_LowPrioritySpinTask)
MyLowPrioritySpinTasks::terminate();
while(MyLowPrioritySpinTasks::terminated < vMyLowPrioritySpinTasks.size())
  mm_pause();
for(int i=0; i<vMyLowPrioritySpinTasks.size(); ++i)
    delete vMyLowPrioritySpinTasks;
#endif

*** Caution, the above, as coded, will make your program 100% active all the time. It is up to you to expand upon this to meet your needs. As an example, you could place a timed sleep function into the top of the execute(). This will introduce ~ 1/2 this sleep time latency in getting your threads going again. The sleep could be conditioned upon how long it took between entries. i.e virtually no time == no other work needs to be done so perform sleep (remember to recapture time/ticks following sleep).

Jim Dempsey

diedler_f_ · ‎09-29-2017

I will try your solution later. Just a few questions :

1) mm_pause() seems to not exist in TBB do you mean _mm_pause() function ?

2) When you speak about doYourProgramHere(), I think I have to put this ?

 tbb::parallel_for(tbb::blocked_range<size_t>(0, nbCombiFast),
     [&](const tbb::blocked_range<size_t>& r) {
         for (size_t i=r.begin(); i != r.end(); ++i)
         {
             std::vector<float> v;
             float r = 0;
             for (size_t o=0; o<10000000; ++o)
             {
                 r += o*sin(o) - cos(o);
             }
             v.push_back(r); // to avoid compiler optimizations...
         }
     }
 );

3) And YourTBBworkerPoolSizeYouDetermineThis() means nbCombiFast (i.e 100 000 elements) for my example ?

4) Should not be faster if I re-enqueue all my tasks with a high priority instead of a low ?

5) you said " Caution, the above, as coded, will make your program 100% active all the time.".

At the first sight, after :

while(MyLowPrioritySpinTasks::terminated < vMyLowPrioritySpinTasks.size())
	mm_pause();

my CPU should not be active at 100% ? Because all tasks will be finished. I need to retrieve a low CPU usage after the big computation for my program.

Thank you very much

jimdempseyatthecove · ‎09-29-2017

1) mm_pause As I said, untested code (may contain typos, missing code, etc...)
2) yes. main() { init TBB, launch spinners, your code here, terminate spinners }
3) No, this means the number of hardware threads you establish for the TBB thread pool. Your i7-6820HQ has 4 cores 8 threads any you typically would use an 8-hardware thread TBB thread pool *** However there are circumstances where you may want to use less or more hardware threads. There may be a TBB function to return the number of hardware threads used by the TBB thread pool.
4) No,
5) Yes. _mm_pause() is an instruction that relieves instruction cycles (and power, L1 ICache activity) but does not suspend the thread from execution. The thread will be in the run state until it exits the loop (or the O/S preempts the software thread).

The 100% is for the duration of the "launch spinners" thru "terminate spinners". Please observe that while I showed launching at start of program and terminating at end of program, you can improve upon this by launching and terminating around specific sections of your code that exhibit the unnecessary startup delays.

Jim Dempsey

diedler_f_ · ‎09-29-2017

I think I did an error while implementing your solution because I have a crash :

In DEBUG mode : the application crash within TBB with the error "pure virtual method called. Terminate called without an active exception."
In RELEASE mode : the application crash after the "END" and before the "return 0;"

The whole code :

#include <iostream>
#include <tbb/tbb.h>
#include <chrono>

#define USE_LowPrioritySpinTask

#if defined(USE_LowPrioritySpinTask)
class MyLowPrioritySpinTask : public tbb::task {
    static bool terminateFlag;
    /*override*/ tbb::task* execute() {
        if(terminateFlag)
        {
           ++terminated;
           return NULL;
        }
        tbb::task::enqueue(*this, tbb::priority_t::priority_low); // re-queue our task at low priority
        return NULL;
    }
  public:
    static tbb::atomic<int> terminated;
    static void terminate() { terminateFlag = true; }
};

bool MyLowPrioritySpinTask::terminateFlag = false;
tbb::atomic<int> MyLowPrioritySpinTask::terminated(0);

#endif

void func()
{
    tbb::atomic<int64_t> hack = 0;
    tbb::parallel_for(tbb::blocked_range<size_t>(0, 100),
    [&](const tbb::blocked_range<size_t>& r) {
        for (size_t i=r.begin(); i != r.end(); ++i)
        {
            float r = 0;
            for (size_t o=0; o<10000000; ++o)
            {
                r += o*sin(o) - cos(o);
            }
            hack += (int64_t)r;
        }
    });
    if (hack) std::cout << "Youpi !";
}

int main()
{
    unsigned int nbThread = 8;

    tbb::task_scheduler_init init(nbThread);

#if defined(USE_LowPrioritySpinTask)
    int nSpinnerTasks = nbThread - 1;
    std::vector<MyLowPrioritySpinTask *> vMyLowPrioritySpinTasks;
    for(int i=0; i<nSpinnerTasks; ++i)
    {
        MyLowPrioritySpinTask * t = new (tbb::task::allocate_root()) MyLowPrioritySpinTask ();
        tbb::task::enqueue(*t, tbb::priority_t::priority_low);
        vMyLowPrioritySpinTasks.push_back(t);
    }
#endif

    // Big Computation here
    auto d = std::chrono::system_clock::now();
    func();
    auto e = std::chrono::system_clock::now();
    auto millis = std::chrono::duration_cast<std::chrono::milliseconds>(e - d).count();
    std::cout << "Milli = " << millis << std::endl;

#if defined(USE_LowPrioritySpinTask)
    MyLowPrioritySpinTask::terminate();
    while(MyLowPrioritySpinTask::terminated < vMyLowPrioritySpinTasks.size())
        _mm_pause();

    std::cout << "DELETING spinners..." << std::endl;
    for(int i=0; i<vMyLowPrioritySpinTasks.size(); ++i)
        delete vMyLowPrioritySpinTasks;
#endif

    std::cout << "END" << std::endl;

    return 0;
}

Maybe something is wrong ?

Thanks

jimdempseyatthecove · ‎09-30-2017

I tried the code and made several derivations each resulting in an assert, then I tried using recycle_as_safe_continuation, this removed the assert, however, when the task recycled, it went onto the normal priority queue (as opposed to keeping the priority it had). This is much harder that what it first appears.

I haven't experimented with task_arena, but I suspect this will have the same symptom (the continued task having higher than low priority).

Does someone else on this forum have any suggestions? (other than wait for a fix)

Jim Dempsey

diedler_f_ · ‎10-02-2017

I will wait for a fix, please let me know where this fix will be available.

Maybe if I have enough time, I will write a custom thread pool for my application but I really don't want to do this...

Does anyone in this forum have a better idea ?

Thanks, specially for Jim Dempsey for helping me

Alexei_K_Intel · ‎10-02-2017

jimdempseyatthecove wrote:

After looking at the chart in #10 I will make an educated guess at what might cause the symptom.

TBB, like virtually all well written multi-threading (w/ thread pool) system, tries to be nice to other processes on the system. To address this, when a (each) thread, after some period of time is unable to find work, it suspends itself. This suspension is typically performed on timed wait on a condition variable (pthread and std::thread/condition_variable, or Windows WaitForSingleEvent). The symptom for the second call is indicative of the TBB thread management code of .NOT. signaling the condition_variable or event for the other thread(s) when work becomes available. IOW the additional threads are not run immediately, but rather startup after the timer expires.

Note, this omission includes the situation whereby the main thread properly notifies one of the waiting threads, but that thread fails to properly notify the correct other threads. Potentially this could be the result of the second thread notifying itself (or other running thread) as opposed to notifying a waiting thread. IOW do not simply look at what the main thread does, but look deeper at what the woken-up threads do.

TBB runtime uses a list of sleeping threads and notify only threads that are in this list. The logic is OS agnostic and pretty simple (get from the list: private_server.cpp#L384-L385, notify the threads: private_server.cpp#L393-L394). Therefore, if the guess is correct then the discussed issue should be reproducable on any OS and any machine. I failed to reproduce the issue on several Windows-based machines. Even if the problem is on TBB side, it is not so simple and caused by specific environment (surely, we have quite extensive coverage in our testing).

Regards,
Alex

CPU not used at 100%