- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi. I have some problem. I write aplication for Intel Xeon Phi (61 cores), which does stencil calculation using 2D matrix (five-point stencil). I would like to use OpenMP 4.0 teams. I would like to create teams which consist of 4 threads running on each core for example Team 1 - threads 1,2,3,4, Team 2 - threads 5,6,7,8 ect, because i try reduce cache miss by doing caculation for 4 threads around the same L2 cache. I tried fix it by set KMP_AFFINITY="proclist=[1,5,9,...,237,2,3,4,6,7,8,10,11,12,...,238,239,240],explicit". This affinity work for a small count of teams. Is any way to set affinity, which solve my problem? Thanks.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Check out a series of blogs on IDZ that I wrote on this subject
https://software.intel.com/en-us/search/site/language/en?query=Chronicles
This is a 5-part blog. It is beneficial to read, or at least scan over, all 5 parts.
The blogs code were run on a 5110P with 60 cores.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
OK. I will read it. I have next question about teams.
int main() { #pragma omp target { #pragma omp teams num_teams(60) num_threads(4) { printf("Teams: %d, Target 1\n", omp_get_team_num()); } } #pragma omp target { #pragma omp teams num_teams(60) num_threads(4) { printf("Teams: %d, Target 2\n", omp_get_team_num()); } } return 0; }
In first target each team write his number, in second target only first team. Why?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jan,
Could you please post your private message to me here such that others can see it, and comment as well.
The only thing that stands out is your thread work partioning scheme can produce a higher degree of unbalanced loads than the traditional omp for loop.
The omp loop divides the universe by the number of workers .AND. takes the remainder of the division.
Each thread thread then receive the divided value + 1 if there worker number is less than the remainder.
This way any disparity between threads is at worst 1 additional iteration.
In your technique, the worst case could potentially have the last thread receiving:
uninverse / number of workers + (number of workers - 1) iterations.
IOW you are lumping the computational burden of the remainder onto the last worker.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ok. Thanks. I understand what is wrong. What about my prievious post?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
On your #3 I do not know the reason.
What happens when you explicitly state the target is mic0?
What happens when you explicitly state the target is mic0 .AND. state num_teams(30) for both offloads?
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It's strange.I ran this code 10 times and i got the same effect as before.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jan,
what compiler version are you using? OMP 4 was rolled into the compilers over time. with the latest 15.0 it's complete, except for user-defined reductions.
Is it possible you have an older ( > 6 months old ) compiler? For OMP 4 you'll want 15.0
ron
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, i have older compiler. Version 14.0.1.106 Build 20131008.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page