How to properly offload a list, process it and return to host?

Jun · ‎04-12-2015

Hi I am new to cilk plus and MIC.

I have a problem where I need to offload/parallel process some data, append to a list B , and return to the host. The host will then combine this list to another host std::list<T> A.

I have following approach:

Create a cilk::reducer_list_append<T> B on host use cilk_shared , B is empty now.
Cilk Plus will automatically transfer B to MIC , MIC does some calculations and append data to B and return B back to host.
B.get_value() will return a std::list<T> C , use A.splice to combine list A and C.

I read this https://software.intel.com/en-us/forums/topic/360604 and it seems that in order to cilk share STL or other container like cilk::reducer_list_append, I need to specify the allocator of B to __offload::shared_allocator<>

So My concern is

1. do I need to construct my A using allocator __offload::shared_allocator<> as well ? Otherwise C will have different allocator with A right?

2. Is there a easy way to solve this problem ? I just need to transfer a list back and forth between host and MIC

Thanks in advance

Jun · ‎04-12-2015

I test it a bit. It seems like I can not use __offload::shared_allocator<> as a allocator for cilk::reducer_list_append<T> . So I am going to create a std::list B shared between Host and MIC.. It will use __offload::shared_allocator<T>

Is it ok to append data from B to A (A uses default std::allocator, B uses __offload::shared_allocator)?

Frances_R_Intel · ‎04-14-2015

These questions seem to be related to a number of issues which have been posted recently on stackoverflow.com.

My belief is (and being more of a Fortran programmer than a C++ programmer, my beliefs can be somewhat off - I expect and will graciously accept corrections) -

If you wish to used _Cilk_shared lists to pass information between the host and coprocessor, you need to use __offload::shared_allocator<>

You should be able to use __offload::shared_allocator<> with cilk::reducer_list_append<T> provided, in addition to declaring your lists to be _Cilk_shared, the class is also declared to be _Cilk_shared by, for example, including the header files between

#pragma offload_attribute (push, _Cilk_shared) 
...
#pragma offload_attribute (pop)

You might be able to splice a list made with the __offload::shared_allocator<> to a list made with the default allocator, However, even if all the entries in the resulting list were in the shared space, the list itself would not be shared and the first time you called the allocator, you would end up with a link from the shared space to outside the shared space, which is a very bad thing.

I might suggest that lists are not necessarily the best way to program the coprocessor. In order to get optimum performance from the coprocessor, you will need for the code to both vectorize and parallelize Also, perhaps _Cilk_shared is not your best solution if the goal is to conglomerate the results from the coprocessor onto the processor.

Jun · ‎04-14-2015

Thanks for your answer . I just realized that I need to use #pragma offload_attribute (push, _Cilk_shared)

And I did some experiment and I realized that if I declare some library header(like Cilk header, iostream ) outside of #pragma offload_attribute

I will get weird runtime error on MIC: can not load library blablabla.

iostream does not work at all when I am using #pragma offload_attribute (push, _Cilk_shared) , no matter I put it in or out of offload_attribute pragma

"_Cilk_shared is not your best solution if the goal is to conglomerate the results from the coprocessor onto the processor." what do you mean by that? I thought the only draw back of using it is that the memory transfer between host and MIC will not be optimized , I can still do whatever I want to vectorize and parallelize the code inside shared function?

I choose to Cilk Plus majorly because it provides nice API (reducer and cilk_for )that I can shorten my development time. I am a graduate student and I need to finish this project in 3 weeks. I looked at openMP it looks like it is harder to write reducer/locks to gather data. And I may get( very likely) to performance which is worse than CPU

Frances_R_Intel · ‎04-15-2015

Jun,

As I pointed out in https://software.intel.com/en-us/comment/1821481#comment-1821481, the iostream header needs to be declared between offload_attribute pragmas but not in a shared memory region. I know of no way to make iostream objects in shared memory. As always, C++ experts are welcome to correct me if there is such a way. So, for headers like this you would need:

#pragma offload_attribute (push)
...
#pragma offload_attribute (pop)

As to my comment that "_Cilk_shared is not your best solution if the goal is to conglomerate the results from the coprocessor onto the processor", I was thinking in terms of being able to move data asynchronously while you do other work. Data in the shared region only updates at the start and end of the offload sections. If you have large data structures in which only a small subset of the memory locations have changed, _Cilk_shared can be very quick, since it only copies changed locations. But if you are moving large chunks of data, there isn't really anyway to hide it.

However, if you are using classes, _Cilk_shared is the way to go and I can't fault your reasons for wanting to use it. For me, it would take longer and be more error prone than using OpenMP but for those who were raised on C++ and similar languages, CilkPlus can be a definite advantage.