Xeon Phi - OpenMP Teams

Jan_K_ · ‎09-07-2014

Hi. I have some problem. I write aplication for Intel Xeon Phi (61 cores), which does stencil calculation using 2D matrix (five-point stencil). I would like to use OpenMP 4.0 teams. I would like to create teams which consist of 4 threads running on each core for example Team 1 - threads 1,2,3,4, Team 2 - threads 5,6,7,8 ect, because i try reduce cache miss by doing caculation for 4 threads around the same L2 cache. I tried fix it by set KMP_AFFINITY="proclist=[1,5,9,...,237,2,3,4,6,7,8,10,11,12,...,238,239,240],explicit". This affinity work for a small count of teams. Is any way to set affinity, which solve my problem? Thanks.

jimdempseyatthecove · ‎09-07-2014

Check out a series of blogs on IDZ that I wrote on this subject

https://software.intel.com/en-us/search/site/language/en?query=Chronicles

This is a 5-part blog. It is beneficial to read, or at least scan over, all 5 parts.

The blogs code were run on a 5110P with 60 cores.

Jim Dempsey

Jan_K_ · ‎09-11-2014

OK. I will read it. I have next question about teams.

int main()
{
	#pragma omp target
	{
		#pragma omp teams num_teams(60) num_threads(4)
		{
			printf("Teams: %d, Target 1\n", omp_get_team_num());
		}
	}

	#pragma omp target
	{
		#pragma omp teams num_teams(60) num_threads(4)
		{
			printf("Teams: %d, Target 2\n", omp_get_team_num());
		}
	}

	return 0;
}

In first target each team write his number, in second target only first team. Why?

jimdempseyatthecove · ‎09-12-2014

Jan,

Could you please post your private message to me here such that others can see it, and comment as well.

The only thing that stands out is your thread work partioning scheme can produce a higher degree of unbalanced loads than the traditional omp for loop.

The omp loop divides the universe by the number of workers .AND. takes the remainder of the division.

Each thread thread then receive the divided value + 1 if there worker number is less than the remainder.

This way any disparity between threads is at worst 1 additional iteration.

In your technique, the worst case could potentially have the last thread receiving:

uninverse / number of workers + (number of workers - 1) iterations.

IOW you are lumping the computational burden of the remainder onto the last worker.

Jim Dempsey

Jan_K_ · ‎09-13-2014

Ok. Thanks. I understand what is wrong. What about my prievious post?

jimdempseyatthecove · ‎09-14-2014

On your #3 I do not know the reason.

What happens when you explicitly state the target is mic0?

What happens when you explicitly state the target is mic0 .AND. state num_teams(30) for both offloads?

Jim Dempsey

Ravi_N_Intel · ‎09-15-2014

I cannot reproduce your problem. Here is a section of the output from my run ....... Teams: 55, Target 0 Teams: 15, Target 0 Teams: 35, Target 0 Teams: 42, Target 0 Teams: 31, Target 0 Teams: 44, Target 0 Teams: 46, Target 0 Teams: 43, Target 0 Teams: 47, Target 0 Teams: 59, Target 0 Teams: 0, Target 1 Teams: 16, Target 1 Teams: 8, Target 1 Teams: 48, Target 1 Teams: 32, Target 1 Teams: 1, Target 1 Teams: 4, Target 1 Teams: 2, Target 1 Teams: 12, Target 1 Teams: 3, Target 1 Teams: 18, Target 1 Teams: 9, Target 1 .........

Jan_K_ · ‎09-15-2014

It's strange.I ran this code 10 times and i got the same effect as before.

Ron_Green · ‎09-15-2014

Jan,

what compiler version are you using? OMP 4 was rolled into the compilers over time. with the latest 15.0 it's complete, except for user-defined reductions.

Is it possible you have an older ( > 6 months old ) compiler? For OMP 4 you'll want 15.0

ron

Jan_K_ · ‎09-16-2014

Yes, i have older compiler. Version 14.0.1.106 Build 20131008.