Hi Kevin Davis, I have fixed

Huang_C_1 · ‎02-25-2016

Or is there a technique to get CPU and MIC do their own calculations at the same time, while communicating to each other periodically?

How can I use only the master thread and hang up other threads in CPU and MIC during the periodical communication?

If it can be done, how is it realized using C++ and Openmp?

Kevin_D_Intel · ‎03-09-2016

I don't believe this is doable with offload language extensions and OpenMP. I'll inquire w/Developers.

Huang_C_1 · ‎03-09-2016

Kevin Davis (Intel) wrote:

I don't believe this is doable with offload language extensions and OpenMP. I'll inquire w/Developers.

Actually I have got a very bad solution to this problem, that is busy wait. My code is:

//use two while loop to do the busy wait mission on MIC and CPU

__attribute__((target(mic)))
void DoorToNewWorld(const int& tid, const int& nthreads)//use in openmp parallel block
{
    if(*current_mission % nthreads == tid)
    {
        killer[*current_mission] = tid;
        while(*current_mission != *next_mission);
        ++(*next_mission);
    }
#pragma omp barrier
}

void Job()
{
#pragma offload target(mic:1) signal(&one)  \
    in     (blocks :   length(500)         REUSE      align(512))  \
    in     (killer :   length(500)         REUSE      align(512))  \
    in     (current_mission  :   length(1)         REUSE)  \
    in     (next_mission     :   length(1)         REUSE)  \
    in     (mic_finish_flag  :   length(1)         REUSE)
    {
#pragma omp parallel num_threads(224)
        {
            int tid = omp_get_thread_num();
            int nthreads = omp_get_num_threads();

            for(int i = 0; i < 500; i++)
            {
#pragma omp critical
                blocks += tid;

#pragma omp single
                (*mic_finish_flag)++;
#pragma omp barrier
                DoorToNewWorld(tid, nthreads);
            }
        }
    }

#pragma omp parallel num_threads(24)
    {
        int tid = omp_get_thread_num();
        int nthreads = omp_get_num_threads();
        for(int i = 0; i < 500; i++)
        {
            if(i % nthreads == tid)
            {
                while(true)
                {
#pragma offload_transfer target(mic:1)  \
    out    (mic_finish_flag  :   length(1)         REUSE)
                    if(*mic_finish_flag == i + 1)//mic has finished
                        break;
                }
                cpu_killer = tid;
            }
#pragma omp barrier

            if(i % nthreads == tid)
            {
#pragma offload_transfer target(mic:1)  \
    out    (blocks     :     length(i + 1)     REUSE)   \
    out    (killer     :     length(i + 1)     REUSE)

        printf("In iteration %d, the block value is %d, the killer is %d, cpu_killer is %d\n", i, blocks, killer, cpu_killer);
        fflush(stdout);

            blocks = 0;
            killer = 0;

#pragma offload_transfer target(mic:1)  \
    in     (blocks     :     length(i + 1)     REUSE)   \
    in     (killer     :     length(i + 1)     REUSE)
            *current_mission = i + 1;
#pragma offload_transfer target(mic:1)  \
    in     (current_mission  :     length(1)     REUSE)
            }
#pragma omp barrier
        }
    }

#pragma offload_transfer target(mic:1) wait(&one)
}

Sometimes it works, but it often get crashed. It is actually not a good solution.

Kevin_D_Intel · ‎03-18-2016

Sorry about the delayed reply. The Developers informed me there's no mechanism for the host and coprocessor to run and exchange messages, that all messages are sent or received from the host, and the target cannot communicate to the host in the middle of an offload.

Maybe others have ideas they can share.

The solution seems error prone where you are transferring data in/out at lines 61-79 without regard for the signal (&one) from the offload at line 15. The mic_finish_flag alone does not seem to be a sufficient indication the initial offload has reached a state where those variables may be reused. That indication comes from the underlying signal associated with your &one signal. Maybe you need to leverage your mic_finish_flag and the offload_wait pragma (described here). The offload_wait can be made to wait for the signal, or not using the if() clause.

jimdempseyatthecove · ‎03-21-2016

Unless your application is terminating, you should not terminate the threads initiated on the MIC by the same process on the Host. The MIC has a default upper limit (per process/thread on the Host) of 512 threads. Therefore, you do not want to spawn and kill threads (or let run to terminate). Instead, have your process instantiate thread pools (Host, MIC0, MIC1, ...), and then reuse them for the life of the application.

Consider using, for each MIC, and optionally for the additional threads on the Host, a bulletin board style messaging system. The Host application, once only, initiates a listener on each MIC and optionally on the host itself. This listener can be initiated either by an OpenMP thread or by a separate host thread (choice up to the programmer). The listeners perform then run in a loop waiting for instructions. The job dispatch is performed between Host and MIC (and optionally Host itself) using two bulletin boards per server. One board is written by the host to publish jobs to be performed (by that listener), the other board is written by the listener to inform the host of the progress (not started, started, finished, error, ...). The listener does not write to the job board, the host does not write to the job status board. Essentially the bulletin boards serve as a Single-Producer/Single-Consumer system. It is up to the programmer to decide what constitutes a posting.

Jim Dempsey

Huang_C_1 · ‎03-22-2016

Hi Kevin Davis, I have fixed the bug of the busy-wait model, and get a little improvement of performance, and a great improvement of unstability. With absolutely right algorithm, the program sometimes may have a deadlock.

Actually, this method is somehow "harmful" to a MIC node. It is easy to get a node down. Maybe it is because high-frequency transferring data between MIC and CPU, making it painful for some hardware or software >_< ."I am so sorry for those nodes down and those engineers who do not know why 'these days so many nodes down'." But actually after I stop struggling on this, the cluster has no nodes down for many days [doge : )], which proves that I am the guilty guy :). Maybe there will be a way to do this job.

And thanks for Jim Dempsey's note :), I will try to make it.

jimdempseyatthecove · ‎03-23-2016

FWIW, I discovered the 512 limit with a similar situation.

The main application is written in C# and the bulk of the computation is performed in Fortran (.dll), sandwiched in between is a C++ .dll. The C# application, prior to discovery of the 512 limitation, would create a job list, then would (inefficiently) spawn thread picker tasks, one per thread to compute on host, and one to offload and wait to compute on MIC. Then each thread would compete picking jobs out of the job list. When the job list emptied, the C# threads would terminate. Note, that tough the Host could be thought of as terribly oversubscribed, as implemented it wasn't due to the threads managing the MIC offloads are in a wait state almost all of the time (they would perform a little work on the host to write results to files, then pick another job).

The test runs, as configured, only ran one job list, therefore, no more than 59, or 118, or 177 (host) MIC offload manage threads were created per run of the process. IOW test seemed to run fine. In practice though, the application would launce, then run more than one job list in a sitting. We normally would not run 4 threads per core on the MIC. Only in later tests, during about the 5th job list, would the program die. Sometimes the MIC would die and have to be stopped and restarted. Due to the offloads being performed in a .dll in a forms application, we had no console output to see the diagnostic error message about the 512 thread limit. Note, this is not concurrent threads, rather this is different host thread id's used for offloads. It is not clear as to if this is per host process or in total.

We only discovered this by adding a console window to the C# forms app. Once discovered, we changed the algorithm on the C# side such that the MIC offload management threads, once created, would retire to a pool as opposed to terminate. Then on subsequent job lists, offload management threads would be resumed as opposed to created.

Lesson learned, use a thread pool for your offload management. Do not terminate and start new threads for offloading.

The above programming model was setup for use with KNC coprocessor(s) in offload manner. Due to the computations being mostly scalar, it is not an efficient setup. We will be porting the application to KNL, but then it will be host based as opposed to offload based.

Jim Dempsey

[Offload mode] How can MIC and CPU run at the same time and exchange data without join threads on MIC