Double buffering

Ignac_A_ · ‎03-04-2014

Hi, I'm new here and I have a question right from the start.

What I'm trying to do is double buffering on XEON PHI. This program seems to run fine if I remove signal/wait that is when I'm not doing any asynchronous copying. But if I do (as in the program below) it will throw at me "Segmentation fault". Does anyone know what might be the problem here?
Whole offload report is in txt file as attachment.
Compile:
icc -openmp double_offload.c -o test.mic

While I'm at it. Can the XEON PHI copy Host->Device and Device->Host simultaneously. Since my goal is to make triple-buffering, if it is possible.

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <time.h>

// aligment of memory; 64 because of size of cacheline
	#define ALLOC alloc_if(1)
	#define REUSE alloc_if(0)
	#define FREE  free_if(1)
	#define RETAIN free_if(0)
	#define ALIGN 64
	


__attribute__ (( target (mic))) void PHI_OFF(float *input, float *output, float *coeff, int nChunks, int chunk_size) {
	int bl,i;
	printf("nChunks: %d chunk_size: %d\n",nChunks,chunk_size);
	
	#pragma omp parallel for private(bl) shared(input,output,coeff)
	for(bl=0; bl<nChunks; bl++) {
		for(i=0;i<chunk_size;i++){
			output[bl*chunk_size+i]=coeff*input[bl*chunk_size+i];
			//output[bl*chunk_size:chunk_size]=coeff[0:chunk_size]*input[bl*chunk_size:chunk_size];
		}
	}
}


int main()
{
	int nChunks=150000;
	int chunk_size=512;

	int input_size=nChunks*chunk_size;
	int half_input_size=input_size/2;

	int f;

	//allocate memmory
	// this works on both device and host
	float  *input;
	float  *output;
	__attribute__((target(mic))) float  *coeff;
	__attribute__((target(mic))) float  *I1;
	__attribute__((target(mic))) float  *I2;
	__attribute__((target(mic))) float  *S1;
	__attribute__((target(mic))) float  *S2;
	
	input = (float*)_mm_malloc( input_size*sizeof(float) ,ALIGN);//input data on host
	output = (float*)_mm_malloc( input_size*sizeof(float) ,ALIGN);//output data on host
	coeff = (float*)_mm_malloc( chunk_size*sizeof(float) ,ALIGN);//coefficients on host and device
	I1=(float*)_mm_malloc(half_input_size*sizeof(float) ,ALIGN);//half of the input data on device
	I2=(float*)_mm_malloc(half_input_size*sizeof(float) ,ALIGN);//half of the input data on device
	S1=(float*)_mm_malloc(half_input_size*sizeof(float) ,ALIGN);//half of the output data on device
	S2=(float*)_mm_malloc(half_input_size*sizeof(float) ,ALIGN);//half of the output data on device
	
	
	//initialize arrays
	srand (time(NULL));
	for(f=0;f<input_size;f++){
		input=(float) (rand() % 10000 + 1.0)/1000.0;
		output=0;
	}
	for(f=0;f<chunk_size;f++){
		coeff=(float) (rand() % 10000 + 1.0)/1000.0;
	}
	
	
    // Allocate memory on the card
	#pragma offload target(mic:0) \
		nocopy(I1[0:half_input_size]:align(ALIGN) RETAIN ALLOC )\
		nocopy(I2[0:half_input_size]:align(ALIGN) RETAIN ALLOC )\
		nocopy(S1[0:half_input_size]:align(ALIGN) RETAIN ALLOC )\
		nocopy(S2[0:half_input_size]:align(ALIGN) RETAIN ALLOC )\
		nocopy(coeff[0:chunk_size]:align(ALIGN) RETAIN ALLOC )
	{}

	
	//Copy coefficients
	#pragma offload_transfer target(mic:0) in(coeff:length(chunk_size) RETAIN REUSE align(ALIGN)) //in(nChunks) in(chunk_size)
	{}

	//-------------> double buffering 
	
	#pragma offload_transfer target(mic:0) \
		in( input[0:half_input_size]:into(I1[0:half_input_size]) RETAIN REUSE align(ALIGN) ) signal(I1)

	
	#pragma offload_transfer target(mic:0) \
		in( input[half_input_size:half_input_size]:into(I2[0:half_input_size]) RETAIN REUSE align(ALIGN) ) signal(I2)	
		
	#pragma offload target(mic:0) \
		nocopy(coeff:length(chunk_size) RETAIN REUSE) \
		nocopy(I1:length(half_input_size) RETAIN REUSE) \
		out(S1[0:half_input_size]:into(output[0:half_input_size]) RETAIN REUSE align(ALIGN) ) wait(I1)
	{
		PHI_OFF(I1,S1,coeff,nChunks/2,chunk_size);
	}	
	
	#pragma offload target(mic:0) \
		nocopy(coeff:length(chunk_size) RETAIN REUSE) \
		nocopy(I2:length(half_input_size) RETAIN REUSE) \
		out(S2[0:half_input_size]:into(output[half_input_size:half_input_size]) RETAIN REUSE align(ALIGN) ) wait(I2)
	{
		PHI_OFF(I2,S2,coeff,nChunks/2,chunk_size);
	}
	
    // Deallocate memory on the card
	
	#pragma offload target(mic:0) \
		nocopy(I1, I2:length(half_input_size) REUSE FREE)\
		nocopy(S1, S2:length(half_input_size) REUSE FREE)\
		nocopy(coeff:length(chunk_size) REUSE FREE )
	{}

	
	
	
	//free the host system memory
	_mm_free(input);
	_mm_free(output);
	_mm_free(coeff);
	_mm_free(I1);
	_mm_free(I2);
	_mm_free(S1);
	_mm_free(S2);

  return 0;
}

Kevin_D_Intel · ‎03-04-2014

It appears you might be experiencing an already fixed defect either in the compiler/library or MPSS. The program builds and runs "as is" with last three Composer XE 2013 SP1 releases (noted below) under MPSS 3.1.2.

Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 14.0.0.080 Build 20130728
Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 14.0.1.106 Build 20131008
Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 14.0.2.144 Build 20140120

What version of MPSS do you install?
What is your compiler version (icc -V)?

Regarding your question: “Can the XEON PHI copy Host->Device and Device->Host simultaneously. Since my goal is to make triple-buffering, if it is possible.”

Are you asking about using IN/OUT for single offload? Could you give an example outline of what you are interested in doing?

Ignac_A_ · ‎03-04-2014

Hi,

thanks for quick answer. The version of the compiler is this: Version 13.1.1.163 Build 20130313 So it's probably that...

To the simultaneous copy:

My question is probably an uninformed one, I'm sorry for that.

The idea is hide transfer time as much as possible and it should be similar to what is possible on GPU cards that is folloving. There are multiple queues with transfer and computation commands which should result in something like this:

Channel 1: | Transfer IN | Computation | Transfer OUT | ... repeat

Channel 2 null | Transfer IN | Computation | Transfer OUT | ...repeat

Channel 3 null null | Transfer IN | Computation | Transfer OUT | ...repeat

This way the transfer time is hidden as much as possible but it (at least in my opinion) the device must be able to simultaneously transfer IN and OUT.

If you mean by single offload one pragma statement (?) then no. Two pragma but asynchronous would suffice.

Kevin_D_Intel · ‎03-04-2014

I am unable reproduce a seg-fault with that particular 13.1 release under either MPSS 3.1.2 or 2.1 so it is possible in your version there's an accidental freeing of one of the arrays that's causing the seg-fault in one of the subsequent offloads?

No need to apologize for any question, ever.

You can use three channels and perhaps more to help hide the transfer time. I will ask our Developers to weigh in on about a sensible number and how much transfers are really simultaneous underneath the hood.

Rajiv_D_Intel · ‎03-04-2014

The current versions of MPSS (MIC driver) do not support bi-directional concurrent data transfers. Data in and compute or data out and compute can happen concurrently but not data in and out. Therefore the speedup for a nicely balanced workload where the data-in time, data-out time and compute-time are all about the same is 2x and not the ideal 3x.

Concurrent data transfer in and out is being worked on.

Ignac_A_ · ‎04-01-2014

Hi there, I hope it is OK if I resurrect this thread, but I think it is related.
I've encountered strange behaviour regarding double buffering. If I use asynchronous copy in (to device) it seems that it is not asynchronous. The asynchronous copy out (to host) seems to be without problems.

My measurements of the time is like this:

				start = elapsedTime(); // uses gettimeofday()
				TOT_time = start - TOT_start;
				printf("                            ;Total time=%0.3fs\n",TOT_time);
				#pragma offload_transfer target(mic:0) ...
				mark = elapsedTime();
				time = mark - start;
				TOT_time = mark - TOT_start;
				printf("pragma time=%0.3fs     ;Total time=%0.3fs\n",time,TOT_time);

So if is the pragma directive asynchronous it should give 0.0s for execution of that pragma.

In both cases I'm copying about 330MB so it should take about 0.33GB/6.8GB/s=0.048s.

Output from my program is
(copy out)
;Total time=0.000s
Copy in I1 time=0.048s ;Total time=0.048s
;Total time=0.048s
Computation S1 time=0.000s ;Total time=0.048s <- This should be asynchronous
;Total time=0.048s
Copy in I2 time=0.066s ;Total time=0.114s

time needed for transfer and calculation asynchronously is lower then with just simple offload ( | transfer in | calculate | transfer out | )

but copy in case is different
;Total time=0.000s
Copy in I1 time=0.013s ;Total time=0.013s <- first copy in not necessary asynchronous
;Total time=0.013s
Copy in I2 time=0.066s ;Total time=0.079s <- second copy in should be asynchronous
;Total time=0.079s
Computation S1 time=0.058s ;Total time=0.137s
;Total time=0.137s
Copy out S1 time=0.060s ;Total time=0.197s
;Total time=0.197s
Copy in I1 time=0.031s ;Total time=0.228s <- same here

In this case the double buffering takes longer than simple offload.
for copy in I'm using pragma like this

#pragma offload_transfer target(mic:0) \
in( input[0:copy_size]:into(I1[0:copy_size]) RETAIN REUSE ) signal(I1)

The question here is am I using it correctly? If I do, what is the problem?

I'm happy to stuck with my copy out version but I would like to understand what is going on.

Kevin_D_Intel · ‎04-02-2014

We had a past defect that has been fixed were local data could cause asynchronous input to become synchronous due to the required use of a buffer for transferring such data but it was said this did not affect pointer and static data. Are you still using the 13.1 compiler?

If you are using a variant of the original code you posted, in your current code that you are timing, what happens with the timings if you declare your pointers with static or as global?

It would be helpful also if we could get a copy of your program that we can investigate further.

Ignac_A_ · ‎04-03-2014

Hi,

I'm using similar program. And now I'm using compiler version 14.0.2 and MPSS Version 2.1.6720-21. Changing the declarations to static didn't change anything.

I think it is OK if I send the program to you. Should I send it via PM?

Kevin_D_Intel · ‎04-03-2014

Ok, thank you for trying static. Yes, for privacy/protection of your source, send me that via a private message.

Kevin_D_Intel · ‎04-05-2014

I reproduced the behavior you described using your test case and requested Development's help with investigating further. I will keep you posted on what I hear.

(Internal tracking id: DPD200255229)