Sergey,

jimdempseyatthecove · ‎06-07-2017

I would like to specify task priority in OpenMP. This is not available in V17.0.1 (nor mentioned in release notes for V17.0.4).

My intended use if for an MPI with OpenMP application where on rank 0, a spawned task (or master thread prior to first task) runs at elevated task priority. This task manages a work queue for task issued to rank 0, as well as issuing task requests to other ranks via MPI messaging.

What I want to accomplish is to have the task manager task .NOT. participate in tasks that it enqueues. Should it do so (which I expect it is doing so now) then it will introduce an undesired latency in servicing the tasks to be issued to the additional MPI ranks (as well as to itself).

Ideally it would be nice to have

#pragma omp task deferred

Where the task is enqueued, but not run by the enqueuing task, except when enqueuing task issues taskwait.

This feature would not require implementation of task priority.

BTW my code restricts number of pending tasks thus would not have too many pending deferred tasks.

Jim Dempsey

SergeyKostrov · ‎06-07-2017

>>...I would like to specify task priority in OpenMP. This is not available in V17.0.1 (nor mentioned in release notes >>for V17.0.4). Here are a couple of notes: - Tasks priorities introduced in OpenMP version 4.5 ( Nov 2015 ) but I'm Not sure if Intel C++ compiler v17 update 4 fully supports it. - I've been talking about priorities for OpenMP threads since 2014 because I was forced to implement my own solution to do that. Since it is always known an OpenMP thread ID it is possible to get an OS thread Handler ( in Windows, Linux, etc ) and after that a priority of an OpenMP thread can be changed, boosted or lowered. - Is it better to request a new KMP_THREADPRIORITY environment variable ( or KMP_TASKPRIORITY ) as an extension for Intel OpenMP runtime library? It could be similar to KMP_AFFINITY environment variable. Something like: ... KMP_AFFINITY=granularity=fine,proclist=[0,1,16,17,48,49,32,33],explicit ... KMP_THREADPRIORITY=prioritylist=[0,4,8,12,16,20,24,28],explicit ... where numbers 0, 4, 8, 12, 16, 20, 24 and 28 correspond to different priorities from Idle to Real-Time. Or, it could be done as additional attribute of KMP_AFFINITY.

jimdempseyatthecove · ‎06-07-2017

Sergey,

Thanks for response.... but you misunderstood the question.

The priority I am talking about is not the thread priority, rather it is the task priority. These are not the same. Let me explain further...

#pragma omp task
{
doWork();

In the above, doWork() can be a deferred task run by some other thread .OR. a direct task by the enqueuing thread (at discretion of implementation). What I want to assure is for the enqueuing thread to .NOT. execute the enqueued task (force task to be deferred).

Consider:

#pragma omp parallel
{
#pragma omp master
{
#pragma omp task priority(1)
{
   for(;!Done;)
   {
      ... // some code
       #pragma omp task    // priority(0)
       {
          doWork(); // not performed by priority(1) at enqueuing time
       }
       ... // other code
    } // for(;!Done;)
    #pragma omp taskwait // priority(1) permitted to doWork() here
} // omp task priority(1)
#pragma omp taskwait
} // omp master
} // omp parallel

I hope I entered that correctly

The goal is for the for(;!Done;) loop to .NOT. take a detour into doWork() during execution of the loop. However, upon Done, subsequent (inner most) taskwait would permit the priority(1) task to execute any pending doWork() tasks.

Jim Dempsey

jimdempseyatthecove · ‎06-07-2017

The reason for the above is the "some code" and "other code" are managing a job queue where jobs are distributed to the ranks of an MPI application (including itself as rank 0).

Without the task priority, the for(;!Done;) loop can detour into doWork(), and thus induce a response latency to the MPI (and rank 0 task processing for(;!Done;)), latency time == runtime of specific doWork(). I do not want this intermittent latency (can be on the order of 60 seconds).

Jim Dempsey

jimdempseyatthecove · ‎06-10-2017

Additional information that may be of use to the readers of this thread.

In my work to this problem, I've discovered a compiler optimization issue with respect to OpenMP tasking. IMHO this is a bug. The failing code is as follows (simplified):

#pragma omp parallel
{
  #pragma omp master
  {
    // negative job number indicates no more jobs
    const int jobListDone = -1;
    const int jobNotAvailable = 888888; // some positive number larger than highest possible job number (typically less than 500)
    atomic<int> jobForRank0 = jobNotAvailable;

    // job queue management task
    #pragma omp task
    {
      for(int jobIndex = 0; jobIndex < jobQueue.size(); ++jobIndex)
      {
        int jobNumber = jobIndex; // simplification of code
        for(;;)
        {
          for(int rank=0; rank < nRanks; ++rank)
          {
            if(availableProcessingResources(rank))
            {
              // found a rank with sufficient resources
              if(rank == 0)
              {
                 // special case for rank 0 (self)
                 if(jobForRank0 == jobNotAvailable)
                 {
                    jobForRank0 = jobNumber;
                    jobNumber = jobNotAvailable; // indicate job dispatched
                    break;
                 }
              }
              else
              {
                // rank > 0
                DispatchJobToMPIRank(rank, jobNumber);
                jobNumber = jobNotAvailable; // indicate job dispatched
                break;
              }
            } // if(availableProcessingResources(rank))
          } // for(int rank=0; rank < nRanks; ++rank)
          if(jobNumber == jobNotAvailable)
            break;
          this_thread(std::chrono::milliseconds(100)); // wait a bit
        } // for(;;)
      } // for(int jobIndex = 0; jobIndex < jobQueue.size(); ++jobIndex)
      // empty job list
      // wait for rank 0 job processing task to consume remaining job number (if any)
      for(;jobForRank0 != jobNotAvailable; )
        this_thread(std::chrono::milliseconds(100)); // wait a bit
      jobForRank0 = jobListDone; // inform rank 0 processing task to exit
    } // #pragma omp task
    
    // rank 0 processing task
    #pragma omp task
    {
       for( ;jobForRank0 != jobListDone; )
       {
          if(jobForRank0 == jobNotAvailable)
          {
            this_thread(std::chrono::milliseconds(100)); // wait a bit
          }
          else
          if(jobForRank0 >= 0)
          {
             int jobNumber = jobForRank0;
             jobForRank0 = jobNotAvailable;
             #pragma omp task firstprivate(jobNumber)
             {
               doWork(jobNumber);
             }
          }
       } // for( ;jobForRank0 != jobListDone; )
       #pragma omp taskwait
     } // omp task
     #pragma omp taskwait
   } // omp master
 } // omp parallel

What is happening, and I am of the opinion this is a bug, Is the second task has a loop that appears to have loop invariant code. From examination in the debugger, the #pragma omp task for the second task captured jobForRank0 as a local copy ... even though this is an atomic<int> (it also captured the const nobNotAvailable) the captured values were both 0, though this may be an artifact of registerized variables, in any event, the atomic<int> jobForRank0 should not have been registerized or captured.

The solution was to use

#pragma omp task shared(jobForRank0)

That variable should have been shared by default. My guess was the optimizer not seeing an explicit shared (and seeing no change within task) took the liberty to make the value captured.

I hope this helps others with similar issues.

Jim Dempsey

pragma omp task priority(n) - when?