[Advice] Efficient way to gather data from global memory

Edgardo_Doerner · ‎04-19-2018

Hi to everyone,

I am working on a OpenCL Monte Carlo code for particle transport, and in order to alleviate the thread divergence, due to the stochastic nature of the method, I am trying to implement an event-by-event simulation, i.e. process concurrently particles that will carry out a specific process (e.g. photo-electric effect, compton scattering, etc.).

Several buffers containing particle information (energy, position, etc) are stored in global memory. Additionally, a buffer containing an index to the next event (i.e. physical process) is associated to each particle. First the most frequent next event is determined using a histogram-like kernel, as the following:

void __kernel freq_kernel(
	// Number of data per work-item to be analyzed
	int nperwork_item,

	// Particle stack
	__global int *gstack_stat,
	
	// Frequency of events
	__global int *gfreq_events) {

	int gid = get_global_id(0);
	int lid = get_local_id(0);

	int lsize = get_local_size(0);
	int gsize = get_global_size(0);

	// Local array containing frequency of events
	__local int lfreq_events[MX_EVENTS];

	// Initialize local frequency array.
	for(int i=lid; i<MX_EVENTS; i+=lsize) { 
		lfreq_events = 0;
	}
	barrier(CLK_LOCAL_MEM_FENCE);

	// Compute the frequency of events local array.
	int ndata = gsize*nperwork_item;
	for(int i=gid; i<ndata; i+=gsize) { 
		atomic_add(&lfreq_events[gstack_stat-1], 1);
	}
	barrier(CLK_LOCAL_MEM_FENCE);

	// Finally, save local results to global memory.
	for(int i=lid; i<MX_EVENTS; i+=lsize) { 
		atomic_add(&gfreq_events, lfreq_events);
	}
	barrier(CLK_LOCAL_MEM_FENCE);

}

The result is an array with the frequency of each possible event. Then it is transferred to the host and the event with the highest frequency is selected as the next event to be processed. In the host we have the following code:

int next_event = 0;
int max_count = 0;

for (cl_uint idevice = 0; idevice < number_of_devices; idevice++) {
        max_count = *std::max_element(h_freq_events[idevice].begin(), h_freq_events[idevice].end());
	next_event = (int)std::distance(h_freq_events[idevice].begin(), std::max_element(h_freq_events[idevice].begin(), h_freq_events[idevice].end()));
}

Now the idea should be to determine which particles must be simulated in the next step. Essentially, I would like to obtain the indexes of such particles in order to gather its particle information (energy, position, etc...) from global memory. A serial alternative could be the following:

for (cl_uint idevice = 0; idevice < number_of_devices; idevice++) {
	for (int i = 0; i<global_work_size[idevice] * nperwork_item[idevice]; i++) {
		if (h_pstack_stat[idevice] == next_event) {
			h_mask[idevice].push_back(i);
		}
	}
}

So I need an advice, do you now an efficient way to obtain the needed indexes?, it is clear that calculating them on the host is very expensive and I am looking for a more efficient way. I have looked for information online without success. Maybe I should first order the indexes array, together with the others holding the particle information, and then carry out the gather operation?. Thanks for your help!.