clEnqueueMap/UnmapBuffer overheads

PCox · ‎06-17-2015

Can someone please straighten me out on expected clEnqueueMapBuffer overheads under Haswell?

Environment: Windows 7 sp1, VS2013, i7-4770, driver 10.18.14.4170

I have my own 768kb buffer which needs to be accessed by the HD 4600 GPU.
I clCreateBuffer with CL_MEM_ALLOC_HOST_PTR, which I believe sets aside some pinned memory for later use. Later on I use clEnqueueMapBuffer (with CL_MAP_READ), and use the resultant pointer to populate the new cl_mem with my data. An event from the clEnqueueMapBuffer call is used to kick off a clEnqueueUnmapMemObject straight afterwards, and similarly, an event from the clEnqueueUnmapMemObject is used in the event_wait_list of the kernel launch straight after that.
Code below, sans error handling.

All fairly straightforward, but timing is a problem. My VTune trace shows the clEnqueueMapBuffer taking 770us in the queue before a 0.3us compute. Then the clEnqueueUnmapMemObject takes 130us followed by a similarly negligible 0.3us compute time. However since my kernel takes only 400us of compute, clEnqueueMap/UnmapBuffer queuing is taking a disproportionate part of the overall time. Am I just in the noise and overheads with such small function times, or can I improve this at all?

		tile_size_bytes = BITMAP_NON_TEXTURED_SIZE_PER_TILE_OPENCL * sizeof(unsigned char);

		input_buffer_cl_mem = 
		clCreateBuffer(oclInstance->context,
			CL_MEM_ALLOC_HOST_PTR,
			tile_size_bytes,
			NULL,
			&errcode_ret);
			
			
		// later....
		
        // map the buffer

        mapped_tile_buffer = clEnqueueMapBuffer(
            oclInstance->queue,
            input_buffer_cl_mem,
            CL_FALSE,
            CL_MAP_READ,
            0,
            tile_size_bytes,
            0,
            NULL,
            &writeTileEvent,
            &errcode_ret);

        // copy strided data into mapped buffer

        thisTileStart = tileStrip + (tileCount * BITMAP_TILE_WIDTH_PIXELS * CHANNEL_COUNT_OPENCL);
        destride_incoming_bitmap_tile_in_strip_into_buffer(thisTileStart, bitmap_step, (char *)mapped_tile_buffer);
")

        // and unmap

        errcode_ret = clEnqueueUnmapMemObject(
            oclInstance->queue,
            input_buffer_cl_mem,
            mapped_tile_buffer,
            1,
            &writeTileEvent,
            &unmapEvent);
		
		
		// (kernel params already set up...
		
		size_t globalSize[3];
		size_t localSize[] = { THREAD_BLOCK_TILE_WIDTH_IN_PIXELS, THREAD_BLOCK_TILE_HEIGHT_IN_PIXELS, 1 };  // blocks are default 64 x 8
		size_t globalSizeWorkgroups[] = { ARR_IMAGE_TILE_WIDTH / THREAD_BLOCK_TILE_WIDTH_IN_PIXELS,			// 256/64 = 4
			ARR_IMAGE_TILE_HEIGHT / THREAD_BLOCK_TILE_HEIGHT_IN_PIXELS,										// 256/8 = 32
			CHANNEL_COUNT_OPENCL };																			// 3

		globalSize[0] = globalSizeWorkgroups[0] * localSize[0];
		globalSize[1] = globalSizeWorkgroups[1] * localSize[1];
		globalSize[2] = globalSizeWorkgroups[2] * localSize[2];

		errcode_ret = clEnqueueNDRangeKernel(
		oclInstance->queue,								                            // command queue
		acej_kernel_tile_ifdct_cpuhuffman,											// kernel
		3,														                    // work_dim
		0,														                    // global_work_offset
		globalSize,												                    // global_work_size
		localSize,					                                                // local_work_size :  localSize or NULL
		1,														                    // num_events_in_wait_list
		&unmapEvent,													            // event_wait_list
		(eventList + tileCount));											        // event

Robert_I_Intel · ‎06-18-2015

Hi Philip,

I am traveling today, so won't be able to try what you are doing until tomorrow.

Meanwhile, you can try the following things to see if the behavior changes in any way:

1. Try to get rid of events.

2. Instead of CL_MEM_ALLOC_HOST_PTR, even though it is a valid use, try allocating memory with _aligned_malloc with 4096 byte alignment and then create a buffer with CL_MEM_USE_HOST_PTR. Make sure that the length of your buffer is a multiple of 64 bytes.

3. There is a newer 15.36.21.64.4222 driver out there (see https://downloadcenter.intel.com/search?keyword=4th+Generation ) - try it to see if it changes behavior in any way.

Let me know how it goes.

Thanks!

Robert_I_Intel · ‎06-19-2015

Hi Philip,

I am back in office. Couple of questions:

1. Could you please provide a complete reproducer that I can run on my system? You have several constants that are undefined in the snippet you provided as well as the kernel itself.

2. You use CL_MAP_READ in the example above where I would expect CL_MAP_WRITE - aren't you writing that mapped buffer?

3. Were you able to try things I suggested?

Thanks!

Robert