Software Archive
Read-only legacy content

Xeon Phi - OpenMP Teams

Jan_K_
Beginner
645 Views

Hi. I have some problem. I write aplication for Intel Xeon Phi (61 cores), which does stencil calculation using 2D matrix (five-point stencil). I would like to use OpenMP 4.0 teams. I would like to create teams which consist of 4 threads running on each core for example Team 1 - threads 1,2,3,4, Team 2 - threads 5,6,7,8 ect, because i try reduce cache miss by doing caculation for 4 threads around the same L2 cache. I tried fix it by set KMP_AFFINITY="proclist=[1,5,9,...,237,2,3,4,6,7,8,10,11,12,...,238,239,240],explicit". This affinity work for a small count of teams. Is any way to set affinity, which solve my problem? Thanks.

0 Kudos
9 Replies
jimdempseyatthecove
Honored Contributor III
645 Views

Check out a series of blogs on IDZ that I wrote on this subject

https://software.intel.com/en-us/search/site/language/en?query=Chronicles

This is a 5-part blog. It is beneficial to read, or at least scan over, all 5 parts.

The blogs code were run on a 5110P with 60 cores.

Jim Dempsey

0 Kudos
Jan_K_
Beginner
645 Views

OK. I will read it. I have next question about teams.

int main()
{
	#pragma omp target
	{
		#pragma omp teams num_teams(60) num_threads(4)
		{
			printf("Teams: %d, Target 1\n", omp_get_team_num());
		}
	}

	#pragma omp target
	{
		#pragma omp teams num_teams(60) num_threads(4)
		{
			printf("Teams: %d, Target 2\n", omp_get_team_num());
		}
	}

	return 0;
}

In first target each team write his number, in second target only first team. Why?

 

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
645 Views

Jan,

Could you please post your private message to me here such that others can see it, and comment as well.

The only thing that stands out is your thread work partioning scheme can produce a higher degree of unbalanced loads than the traditional omp for loop.

The omp loop divides the universe by the number of workers .AND. takes the remainder of the division.

Each thread thread then receive the divided value + 1 if there worker number is less than the remainder.

This way any disparity between threads is at worst 1 additional iteration.

In your technique, the worst case could potentially have the last thread receiving:

uninverse / number of workers + (number of workers - 1) iterations.

IOW you are lumping the computational burden of the remainder onto the last worker.

Jim Dempsey

0 Kudos
Jan_K_
Beginner
645 Views

Ok. Thanks. I understand what is wrong. What about my prievious post?

0 Kudos
jimdempseyatthecove
Honored Contributor III
645 Views

On your #3 I do not know the reason.

What happens when you explicitly state the target is mic0?

What happens when you explicitly state the target is mic0 .AND. state num_teams(30) for both offloads?

Jim Dempsey

 

0 Kudos
Ravi_N_Intel
Employee
645 Views
I cannot reproduce your problem. Here is a section of the output from my run ....... Teams: 55, Target 0 Teams: 15, Target 0 Teams: 35, Target 0 Teams: 42, Target 0 Teams: 31, Target 0 Teams: 44, Target 0 Teams: 46, Target 0 Teams: 43, Target 0 Teams: 47, Target 0 Teams: 59, Target 0 Teams: 0, Target 1 Teams: 16, Target 1 Teams: 8, Target 1 Teams: 48, Target 1 Teams: 32, Target 1 Teams: 1, Target 1 Teams: 4, Target 1 Teams: 2, Target 1 Teams: 12, Target 1 Teams: 3, Target 1 Teams: 18, Target 1 Teams: 9, Target 1 .........
0 Kudos
Jan_K_
Beginner
645 Views

It's strange.I ran this code 10 times and i got the same effect as before.

0 Kudos
Ron_Green
Moderator
645 Views

Jan,

what compiler version are you using?  OMP 4 was rolled into the compilers over time.  with the latest 15.0 it's complete, except for user-defined reductions.

Is it possible you have an older ( > 6 months old ) compiler?  For OMP 4 you'll want 15.0

ron

0 Kudos
Jan_K_
Beginner
645 Views

Yes, i have older compiler. Version 14.0.1.106 Build 20131008.

0 Kudos
Reply