Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
2 Views

Xeon Phi - OpenMP Teams

Hi. I have some problem. I write aplication for Intel Xeon Phi (61 cores), which does stencil calculation using 2D matrix (five-point stencil). I would like to use OpenMP 4.0 teams. I would like to create teams which consist of 4 threads running on each core for example Team 1 - threads 1,2,3,4, Team 2 - threads 5,6,7,8 ect, because i try reduce cache miss by doing caculation for 4 threads around the same L2 cache. I tried fix it by set KMP_AFFINITY="proclist=[1,5,9,...,237,2,3,4,6,7,8,10,11,12,...,238,239,240],explicit". This affinity work for a small count of teams. Is any way to set affinity, which solve my problem? Thanks.

0 Kudos
9 Replies
Highlighted
2 Views

Check out a series of blogs

Check out a series of blogs on IDZ that I wrote on this subject

https://software.intel.com/en-us/search/site/language/en?query=Chronicles

This is a 5-part blog. It is beneficial to read, or at least scan over, all 5 parts.

The blogs code were run on a 5110P with 60 cores.

Jim Dempsey

0 Kudos
Highlighted
Beginner
2 Views

OK. I will read it. I have

OK. I will read it. I have next question about teams.

int main()
{
	#pragma omp target
	{
		#pragma omp teams num_teams(60) num_threads(4)
		{
			printf("Teams: %d, Target 1\n", omp_get_team_num());
		}
	}

	#pragma omp target
	{
		#pragma omp teams num_teams(60) num_threads(4)
		{
			printf("Teams: %d, Target 2\n", omp_get_team_num());
		}
	}

	return 0;
}

In first target each team write his number, in second target only first team. Why?

 

 

0 Kudos
Highlighted
2 Views

Jan,

Jan,

Could you please post your private message to me here such that others can see it, and comment as well.

The only thing that stands out is your thread work partioning scheme can produce a higher degree of unbalanced loads than the traditional omp for loop.

The omp loop divides the universe by the number of workers .AND. takes the remainder of the division.

Each thread thread then receive the divided value + 1 if there worker number is less than the remainder.

This way any disparity between threads is at worst 1 additional iteration.

In your technique, the worst case could potentially have the last thread receiving:

uninverse / number of workers + (number of workers - 1) iterations.

IOW you are lumping the computational burden of the remainder onto the last worker.

Jim Dempsey

0 Kudos
Highlighted
Beginner
2 Views

Ok. Thanks. In understand

Ok. Thanks. I understand what is wrong. What about my prievious post?

0 Kudos
Highlighted
2 Views

On your #3 I do not know the

On your #3 I do not know the reason.

What happens when you explicitly state the target is mic0?

What happens when you explicitly state the target is mic0 .AND. state num_teams(30) for both offloads?

Jim Dempsey

 

0 Kudos
Highlighted
Employee
2 Views

I cannot reproduce your

I cannot reproduce your problem. Here is a section of the output from my run ....... Teams: 55, Target 0 Teams: 15, Target 0 Teams: 35, Target 0 Teams: 42, Target 0 Teams: 31, Target 0 Teams: 44, Target 0 Teams: 46, Target 0 Teams: 43, Target 0 Teams: 47, Target 0 Teams: 59, Target 0 Teams: 0, Target 1 Teams: 16, Target 1 Teams: 8, Target 1 Teams: 48, Target 1 Teams: 32, Target 1 Teams: 1, Target 1 Teams: 4, Target 1 Teams: 2, Target 1 Teams: 12, Target 1 Teams: 3, Target 1 Teams: 18, Target 1 Teams: 9, Target 1 .........
0 Kudos
Highlighted
Beginner
2 Views

It's strange.I ran this code

It's strange.I ran this code 10 times and i got the same effect as before.

0 Kudos
Highlighted
Moderator
2 Views

Jan,

Jan,

what compiler version are you using?  OMP 4 was rolled into the compilers over time.  with the latest 15.0 it's complete, except for user-defined reductions.

Is it possible you have an older ( > 6 months old ) compiler?  For OMP 4 you'll want 15.0

ron

0 Kudos
Highlighted
Beginner
2 Views

Yes, i have older compiler.

Yes, i have older compiler. Version 14.0.1.106 Build 20131008.

0 Kudos