Doing "#pragma offload target

TK · ‎06-10-2013

i have my functions as follows:

[cpp] double _Cilk_shared a(...);

double aa(...);

void _Cilk_shared callA()

{

int i,j;

#pragma omp parallel for private(i,j)

for(j=0;,...)

for(i=0;...)

data2[...]=a();

}

[/cpp]

The only difference between "a" and "aa" is that "a" has _Cilk_shared. Then I run two cases as follows:

[cpp] time1=omp_get_wtime();

#pragma omp parallel for private(i,j)

for(j=0; ...)

for(i=0;...)

data[...]=aa(...);

timeDiff=omp_get_wtime()-time1;

...

time1=omp_get_wtime();

_Cilk_offload callA();

timeDiff=omp_get_wtime()-time1;

[/cpp]

Why my cilk offloaded execution takes way longer than the one executed on host? Am I measuring the elapsed times correctly? Thanks.

TimP · ‎06-10-2013

If I understand your example to show that you are comparing the time taken to copy an array host to host vs. host to coprocessor to host, there's little doubt the latter will take longer.

In some circumstances, there is a delay of KMP_BLOCKTIME (default 0.200 seconds) while a thread is held open by OpenMP before Cilk(tm) Plus could take it over. This may not be one of those.

TK · ‎06-11-2013

Hi Tim, I have a struct smth. of this kind:

[cpp] typedef struct {

int a[...];

double ****b;

...

} _Cilk_shared aStruct;

[/cpp]

Most of the variables of struct, especially large arrays, are used in both "a" and "aa" in exactly the same manner. Also "data" is a host array, whereas "data2" is array declared with _Cilk_shared. What else could be the reason? Thanks.

Frances_R_Intel · ‎06-11-2013

But the function aa is being executed on the host, not on the coprocessor, right? If that is true, then even if the array is in a _Cilk_shared structure, the data is not being moved from the host to the coprocessor when the function aa is called. The data is sync'ed up between host and coprocessor only then an offload region is entered or exited.

TK · ‎06-11-2013

Yes the function "aa" is executed on the host. I am measuring the times of execution on host and coprocessor. Following is executed on host:

[cpp] time1=omp_get_wtime();

#pragma omp parallel for private(j,i)

for(j=0;...)

for(i=0;...)

data[...]=aa(...);

timeElap_Host=omp_get_wtime()-time1;

[/cpp]

And the following is executed on the coprocessor;

[cpp] time1=omp_get_wtime();

_Cilk_offload callA(...);

timeElap_Co=omp_get_wtime()-time1;

[/cpp]

I was just wondering why the timeElap_Co is so high compared to timeElap_Host? On host I am running it with 6 threads, and on coprocessor 236. Thanks.

robert-reed · ‎06-11-2013

Given your results, I must guess that the contents of a()/aa() are computationally trivial or thread contentiously high, and so the time you're measuring comprises mostly the copy time for moving "aStruct" to the coprocessor and for copying the data2 image from coprocessor back to host, plus whatever setup/overhead time is incurred for the OpenMP parallel loop. Yes, data2 is declared as shared but the _Cilk_offload call is synchronous and the timers wrapped around the call would include the time to resynch the host copy of data2. And if my assumption that a()/aa() are computationally trivial is true, most of the rest of the time that isn't copying between processor and coprocessor is setting up threads (a more substantial task on the coprocessor than on the processor).

Alternatively, we have a parallel construct over a function (a/aa) operating on an aStruct that contains large arrays with no indication of any thread safety issues in the operations upon that struct. If there is contention or locking, the costs operating over 6 threads will be a lot less than the costs operating over 236 threads. There's also no indication whether a/aa employ vectorizable operations. Bottom line: the core chosen for this current Intel Xeon Phi implementation was picked for its power efficiency, not its speed, so vectorization is essential to expose the performance features of the coprocessor. So ultimately, we don't know enough about your code to give a complete explanation of the differences but it's likely that your test does not contain enough parallel work to amortize for the cost of moving the computation over to the coprocessor.

TK · ‎06-11-2013

Is there a way I can measure the time of copying things to/from coprocessor? Thanks.

robert-reed · ‎06-11-2013

If you set the environment variable OFFLOAD_REPORT to "3" you should get some data about transfer sizes and times (from the compiler reference):

Controls printing offload execution time, in seconds, and the amount of data transferred, in bytes. This environment variable is equivalent to using the __Offload_report API.

Supported values:

1: Produces a report about time taken.
2: In additon to the information produced at value 1, adds the amount of data transferred between the CPU and the coprocessor.
3: In addition to the information produced at value 2, gives additional details on offload activity, including device initialization, and individual variable transfers.

robert-reed · ‎06-11-2013

Oops, I forgot. OFFLOAD_REPORT only provides information about explicit offloads, not Virtual Shared Memory-based offloads. Someone better versed in the compiler than me reminded me of this but did not indicate whether there is a way to obtain similar information about VSM transfers. About all I can think of as an alternative is to time interval of the offload call on the host and time as much of the offloaded function on the coprocessor as you can. That will at least provide a bound between coprocessor execution time and data transfer/offload-setup time, though it won't give you any hint how to portion that into downloads and uploads.

TK · ‎06-12-2013

I tried OFFLOAD_REPORT=3, it gives me the following

[Offload] [HOST] [State] MYO shared mallocSharedMalloc 80008

[Offload] [HOST] [State] MYO shared mallocSharedMalloc 8

[Offload] [HOST] [State] MYO release

[Offload] [HOST] [State] MYO acquire

[Offload] [HOST] [State] MYO shared freeSharedFree

What does the above mean? So is there any other way where I could for sure see the (Virtual Shared Memory) offload and computation times separately? Thanks.

robert-reed · ‎06-12-2013

The meaning of these reports seems pretty straightforward: these are all reported from the host side. The first two lines are allocations from the VSM buffer (a region of common virtual addresses reserved on both host and coprocessor for sharing data). I'm guessing those sizes are in bytes. Next comes your _Cilk_offload call, and wrapping that above are a release of the host-side block for modification on the coprocessor (presumably asssociated with the data copy to the coprocessor) and likewise the host reaquires the block with acquire statement (also presumably asssociated with another data transfer back to the host). Finally, when your program ends the block reserved in the VSM buffer is released.

Unfortunately, as mentioned above, the detailed reports on offload data transfers I was hoping for are only produced for cases of explicit offload (e.g, #pragma offload target(mic;0) in(aStruct) out(data2) or something like that). To get reports on the actual data flow, you could convert this basic VSM-based offload example into one that uses such explicit directives and generates with OFFLOAD_REPORT more detailed output (an example of which is):

[FILE] filename.c

[LINE] 145

[CPU-TIME] 0.52

[CPU->MIC DATA] 0

[MIC-TIME] 0.000151

[MIC->CPU DATA] 4

If switching from your VSM-based example is not possible, about all I can offer you at the moment is the idea I mentioned in a previous post of measuring the "outside and inside" offload execution times by timing the duration of the offload on the host and also measuring the duration of the offloaded function on the coprocessor: the difference in those two numbers is an outer bound on the sum of offload setup times and data transfer times, not very explicit but probably the best we can get for VSM/MYO-based data transfers at this time.

TK · ‎06-12-2013

Doing "#pragma offload target(mic) in(aStruct) out(data2)" did not work. Ok, thanks a lot!

robert-reed · ‎06-12-2013

That was intended as a guide, not as a complete solution. You'll need to specify offload-attributes for the affected functions, types and data structures. You might need to consult a tutorial on explicit offload and do a little work to get the code working. I can't tell whether there was a touch of sarcasm in your thank-you, but I think exploring this sample using an explicit offload form of it will at least get you a complete OFFLOAD_REPORT, from which you'll have a better idea where the time is being taken.

TK · ‎06-12-2013

Sorry, my bad. Thanks!

execution time difference