- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can someone please straighten me out on expected clEnqueueMapBuffer overheads under Haswell?
Environment: Windows 7 sp1, VS2013, i7-4770, driver 10.18.14.4170
I have my own 768kb buffer which needs to be accessed by the HD 4600 GPU.
I clCreateBuffer with CL_MEM_ALLOC_HOST_PTR, which I believe sets aside some pinned memory for later use. Later on I use clEnqueueMapBuffer (with CL_MAP_READ), and use the resultant pointer to populate the new cl_mem with my data. An event from the clEnqueueMapBuffer call is used to kick off a clEnqueueUnmapMemObject straight afterwards, and similarly, an event from the clEnqueueUnmapMemObject is used in the event_wait_list of the kernel launch straight after that.
Code below, sans error handling.
All fairly straightforward, but timing is a problem. My VTune trace shows the clEnqueueMapBuffer taking 770us in the queue before a 0.3us compute. Then the clEnqueueUnmapMemObject takes 130us followed by a similarly negligible 0.3us compute time. However since my kernel takes only 400us of compute, clEnqueueMap/UnmapBuffer queuing is taking a disproportionate part of the overall time. Am I just in the noise and overheads with such small function times, or can I improve this at all?
tile_size_bytes = BITMAP_NON_TEXTURED_SIZE_PER_TILE_OPENCL * sizeof(unsigned char); input_buffer_cl_mem = clCreateBuffer(oclInstance->context, CL_MEM_ALLOC_HOST_PTR, tile_size_bytes, NULL, &errcode_ret); // later.... // map the buffer mapped_tile_buffer = clEnqueueMapBuffer( oclInstance->queue, input_buffer_cl_mem, CL_FALSE, CL_MAP_READ, 0, tile_size_bytes, 0, NULL, &writeTileEvent, &errcode_ret); // copy strided data into mapped buffer thisTileStart = tileStrip + (tileCount * BITMAP_TILE_WIDTH_PIXELS * CHANNEL_COUNT_OPENCL); destride_incoming_bitmap_tile_in_strip_into_buffer(thisTileStart, bitmap_step, (char *)mapped_tile_buffer); ") // and unmap errcode_ret = clEnqueueUnmapMemObject( oclInstance->queue, input_buffer_cl_mem, mapped_tile_buffer, 1, &writeTileEvent, &unmapEvent); // (kernel params already set up... size_t globalSize[3]; size_t localSize[] = { THREAD_BLOCK_TILE_WIDTH_IN_PIXELS, THREAD_BLOCK_TILE_HEIGHT_IN_PIXELS, 1 }; // blocks are default 64 x 8 size_t globalSizeWorkgroups[] = { ARR_IMAGE_TILE_WIDTH / THREAD_BLOCK_TILE_WIDTH_IN_PIXELS, // 256/64 = 4 ARR_IMAGE_TILE_HEIGHT / THREAD_BLOCK_TILE_HEIGHT_IN_PIXELS, // 256/8 = 32 CHANNEL_COUNT_OPENCL }; // 3 globalSize[0] = globalSizeWorkgroups[0] * localSize[0]; globalSize[1] = globalSizeWorkgroups[1] * localSize[1]; globalSize[2] = globalSizeWorkgroups[2] * localSize[2]; errcode_ret = clEnqueueNDRangeKernel( oclInstance->queue, // command queue acej_kernel_tile_ifdct_cpuhuffman, // kernel 3, // work_dim 0, // global_work_offset globalSize, // global_work_size localSize, // local_work_size : localSize or NULL 1, // num_events_in_wait_list &unmapEvent, // event_wait_list (eventList + tileCount)); // event
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Philip,
I am traveling today, so won't be able to try what you are doing until tomorrow.
Meanwhile, you can try the following things to see if the behavior changes in any way:
1. Try to get rid of events.
2. Instead of CL_MEM_ALLOC_HOST_PTR, even though it is a valid use, try allocating memory with _aligned_malloc with 4096 byte alignment and then create a buffer with CL_MEM_USE_HOST_PTR. Make sure that the length of your buffer is a multiple of 64 bytes.
3. There is a newer 15.36.21.64.4222 driver out there (see https://downloadcenter.intel.com/search?keyword=4th+Generation ) - try it to see if it changes behavior in any way.
Let me know how it goes.
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Philip,
I am back in office. Couple of questions:
1. Could you please provide a complete reproducer that I can run on my system? You have several constants that are undefined in the snippet you provided as well as the kernel itself.
2. You use CL_MAP_READ in the example above where I would expect CL_MAP_WRITE - aren't you writing that mapped buffer?
3. Were you able to try things I suggested?
Thanks!
Robert

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page