OpenMP untied task and taskyield does not work as expected

Geiser__Georg · ‎12-08-2017

Hi,

I tried to implement untied OpenMP tasks that yield when a modeled idle time due to synchronization is encountered. The code is available at stack overflow: https://stackoverflow.com/questions/47658571/how-to-yield-resume-openmp-untied-tasks-correctly

It turns out that icc 18.0.1 supports yielding for non-master threads, icc 17.0.4 did not. However, tasks still seem to be tied, since the allocated thread always resumes its suspended tasks. There is no task migration! Furthermore, I noticed that tied tasks, i.e. when you remove the "untied" option from the code, do not yield at all, but I think they should.
Actually, I would expect the "tied" code to behave exactly like the given "untied" one does with the present compiler. The expected "untied" behavior cannot be achieved.

Are there any restrictions in the OpenMP runtime? Where are they documented? Or is this a bug?

Best regards

Georg Geiser

Andrey_C_Intel1 · ‎12-11-2017

Couple of points worth mentioning:

1. This is not a bug because OpenMP standard defines the behavior you expect as an optimization, not as requirement.

2. To make an experiment you can try clang compiler, where untied tasks were "properly" implemented from the beginning, and recently untied tasks were excluded from task scheduling constraint (TSC) in the OpenMP runtime library. Without this change a thread usually cannot steal untied task from other thread because of TSC, thus it looks like no yielding occur. Unfortunately, this code hasn't been released yet, it will be present in coming clang 6.0 release. For now it is possible to get OpenMP runtime library from LLVM trunk for experiments with already released clang compiler.

3. The implementation of untied tasks cannot be done in runtime library only, because only compiler can split untied task in parts (as it is user's code that needs to be split). AFAIK, Intel compiler still misses the needed code generation (I guess because of lack of demand for this feature), so you are welcome to submit the request for this to be implemented trough compiler support channel (the more people request the feature the sooner it should be implemented :).

Regards,
Andrey

Geiser__Georg · ‎12-11-2017

Ok, I will give the recent clang compiler a try.

However, I still think that icc should support yiedling for tied tasks. If you remove the untied option, there will be no yielding at all. There is no restriction of taskyield to untied tasks in the OpenMP standard. You could easily write a code that runs into deadlock, but is totally standard conforming.
Stealing tasks is a different thing and I can imagine that is non-trivial to implement. However, I think this has great potential for heavily disbalanced workloads when taskyield is used to make use of idle times when e.g. waiting for communicated data. So, I will contact compiler support to ask for this feature.

Best regards,

Georg

Andrey_C_Intel1 · ‎12-11-2017

Geiser, Georg wrote:

However, I still think that icc should support yiedling for tied tasks. If you remove the untied option, there will be no yielding at all. There is no restriction of taskyield to untied tasks in the OpenMP standard. You could easily write a code that runs into deadlock, but is totally standard conforming.

Stealing tasks is a different thing ...

The yielding should always work, e.g. if a thread encounters taskyied construct it interrupts current task and looks for other task to execute, either self-generated or tries to steal from other thread. If it finds a task allowed to execute by TSC then the thread will run it, and then return to the postponed task to continue its execution. Fresh Intel compilers (17.0 update 5, or 18.0 update 1) have untied tasks excluded from TSC check (earlier compilers treat tied and untied tasks equally), so if there are untied tasks available the thread should pick them for execution before continuing the postponed task.

If a thread fails to find any task to execute, then it continues the postponed task, and this can be observed as "no yielding", but the yielding actually happens always.

Regards,
Andrey

jimdempseyatthecove · ‎12-11-2017

>> However, I think this has great potential for heavily disbalanced workloads when taskyield is used to make use of idle times when e.g. waiting for communicated data.

You can accomplish similar functionality by oversubscribing threads (by the number of such I/O stalling tasks).

Jim Dempsey

Geiser__Georg · ‎12-12-2017

Andrey Churbanov (Intel) wrote:

The yielding should always work...

If you run the referenced example code with removed untied option using 3 OpenMP threads, there is no yielding. I used icc 18.0.1 to check this. Only the untied version yields even though there should be plenty of deferred tasks. Even the task generating thread does not switch to a different task. So, is there any error in the way I am creating the tasks, such that there are no deferred tasks available?

Best regards,

Georg

Geiser__Georg · ‎12-12-2017

jimdempseyatthecove wrote:

You can accomplish similar functionality by oversubscribing threads (by the number of such I/O stalling tasks).

I think this way I would loose the benefit of the cooperative multitasking that the OpenMP tasks offer. Preemptive multitasking by probably unbound oversubscribed threads would definitely worsen cache performance. I want CPU bursts to run as long as possible on an exclusive hardware thread.

I think the taskyield mechanism is a nice and portable way to have coroutines in C and Fortran, which are not part of the languages per se. Here, support for untied task stealing would help to achieve the best load balance. Available libraries (e.g. GNU Portable Threads) do not support manycore architectures, so I cannot use them. There are also plenty of other libraries (cf. Wikipedia on coroutines), which are usually non-portable and do not support manycores.

Best regards,

Georg

jimdempseyatthecove · ‎12-12-2017

>>I want CPU bursts to run as long as possible on an exclusive hardware thread.

Untied task are, well, untied aren't they? IOW are not placement savvy between software thread and hardware thread.

It sounds like affinitized computational sections of code (possibly using all available hardware threads) interspersed with I/O sections. Furthermore, because you have consider co-routine program structure, (correct me if I have misunderstood) that you desire to specify at which point in a thread's execution that you would desire, or willing to be, intervened with the responsibility with performing or interacting with a/the I/O.

I am assuming that an OpenMP barrier is too heavyweight (as well as unsuitable) for your purposes. For ersatz co-routine consider something like the following untested code:

MODULE MOD_COROUTINE
  use omp_lib
  integer(KIND=OMP_LOCK_KIND) :: COROUTINE_LOCK
  integer :: COROUTINE_STATE
CONTAINS
  SUBROUTINE COROUTINE_INIT()
    CALL OMP_INIT_LOCK(COROUTINE_LOCK)
    COROUTINE_STATE = 0
  END SUBROUTINE COROUTINE_INIT

  SUBROUTINE COROUTINE_FINI()
    DO WHILE(COROUTINE_STATE .NE. 0)
      CALL COROUTINE()
    END DO
    CALL OMP_DESTROY_LOCK(COROUTINE_LOCK)
  END SUBROUTINE COROUTINE_FINI

  SUBROUTINE COROUTINE()
    IF(.NOT.OMP_TEST_LOCK(COROUTINE_LOCK)) RETURN
    COROUTINE_STATE = COROUTINE_STATE + 1
    SELECT CASE (COROUTINE_STATE)
    CASE(1)
      ...
    CASE(2)
      ...
    ...
    CASE(N)
      ...
      COROUTINE_STATE = 0
    END SELECT
    CALL OMP_UNSET_LOCK(COROUTINE_LOCK)
  END SUBROUTINE COROUTINE
END MODULE MOD_COROUTINE
...
PROGRAM YOUR_PROGRAM
  USE MOD_COROUTINE
  ...
  CALL COROUTINE_INIT()
  ...
  ! ANY THREAD INSIDE OR OUTSIDE PARALLEL REGION CAN ISSUE
  ! AT CONVIENENT PLACE
  CALL COROUTINE()
  ...
  CALL COROUTINE_FINI()
END PROGRAM YOUR_PROGRAM

The above will permit all threads to participate in any one of the threads advancing the COROUTINE with little overhead to those threads not advancing the state.

What the above does not do is return the CPU time for the suspended thread when suspension occurs for I/O.

You haven't completely described the requirements of your program. The suggestions made here do not take into consideration your entire programming requirements.

Jim Dempsey

Geiser__Georg · ‎12-13-2017

jimdempseyatthecove wrote:

Untied task are, well, untied aren't they?

Yes, you are right, untied means untied, but does untied also mean "migrating during task execution"? I had expected that migration only might take place at synchronization points. The OpenMP standard says (page 85, line 24f in v4.5):

If the untied clause is present on a task construct, any thread in the team can resume the task region after a suspension.

Here, I would expect that suspension is only due to taskyield. Or is there any other case where the task is suspendend and/or migrated?

The application is a CFD solver running on large compute clusters. This means heavy numbercrunching interrupted by MPI communication with remote nodes. So for me it is ok, if the cache gets flushed when an MPI message is pending for a long time, but the idle time can be used to do some different numbercrunching. Usually the computational domain is that large that the cache for a different compute kernel will be cold, maybe warm, but it is critical that it stays hot during kernel execution. Here, preemptive interruptions would certainly degrade performance. Most MPI libraries do not use progress threads by default also for that reason. We only use oversubscribed threads for really slow asynchronous file system I/O to keep the simulation running while a result file is written.

Since the code base is large, it is not easy to establish a multitasking system that offers dynamic load balancing in shared-memory. Generally load balancing is not easy for CFD since the load balancer usually only accounts for equal shares of the fluid volume to be simulated, but some boundary conditions create large workloads and a lot of collective communication is required to compute integral data. Common hybrid parallelism using OpenMP worksharing for the loop nests usually suffers from too small workshares for modern manycore processors (false sharing might be a problem as well). Furthermore, many loop nests have to be adapted, more severe data races can occur, and there might be a lot of serial code outside the loop nests resulting in worse performance according to Amdahl's law.

You are definitely right that your proposed coroutines using a state-machine would be more efficient, but unfortunately it is impossible to rewrite the whole code in such a fashion. The code is C so I do not have modules like Fortran that would allow to encapsulate the coroutines' data. So, my idea was to put all functions into OpenMP tasks with correctly set dependencies. These functions could even generate more tasks, if their workload is too high. Assigning sub-teams is difficult since I would have to estimate on the number of threads to assign to a task. I think the taskloop construct could help to distribute the most expensive loop nests even though the overhead compared to standard loop worksharing is probably higher. If communication is required, it will be issued in a non-blocking way and the synchronization point does not call an MPI_Wait(), but an MPI_Test() and a subsequent taskyield until the communication is completed. Meanwhile, a different function can be executed and the task scheduler can reenter the function later on. Inside a NUMA domain this approach should be able to dynamically process the tasks that are ready to be executed, hopefully resulting in less idle time.

The good thing is that I do not have to split my functions into multiple parts to be able to check for the MPI communication to be finished. So the task actually constitutes a coroutine that allows a straightforward implementation and the developer does not have to care about overlapping communication and computation as long as there are deferred tasks on the queue. This improves code maintainability a lot!

I am happy that yielding generally seems to work for the most recent compilers. However, I am a little confused that the minimal example I referenced does not yield when using tied tasks. Furthermore, task stealing might be critical for my application, since the code will generate strongly inhomogeneous workloads inside the tasks. So, hopefully this feature is coming in the future!

jimdempseyatthecove · ‎12-13-2017

C vs Fortran...

language choice is immaterial for any competent programmer (consider it sketch/pseudo code).

A problem you have (to overcome) is:

#pragma omp task

and

#pragma omp for

(even with num_threads(n))

produce a conflagration with thread pool usage. I haven't tested it but maybe dynamic scheduling would help. It would be nice if there were

#pragma omp for num_threads(omp_get_available_threads())

*** hypothetical function.above

RE: taskyield

A problem you have is taskyield yields to a task in the current region. Nothing inherently wrong with this except that if the region has A, B, C, D, E, F, G tasks. 7 pending and for this example assume that there are 4 threads, any 4 of the 7 task could be running, then assume one issues a taskyield starting one of the 3 pending tasks. All good so far. Now consider while one of the tasks is in the taskyield polling loop, and a second task (one of the now 4 running tasks) also enters a taskyield. When this occurs, you may end up with the taskyielding tasks yielding to each other. IOW no productive work being done.

A curious question arises by reading openmp-4.5.pdf section 3.3.4 (omp_set_lock), Effect, line 2&3:

Each of these routines has an effect equivalent to suspension of the task executing the routine until the specified lock is available.

It is likely an implementation issue as to if the "suspension" is implemented via a spinwait .OR. interacts with the OpenMP task scheduler. If the latter, then you might be able to construct a section of code that you use to substitute for taskyield that produces the desired effect (i.e. eliminate Ping-Pong of yielding tasks).

Jim Dempsey