Hi Michael,

Martin_S_2 · ‎01-02-2015

When copying from CPU to GPU on Intel Graphics hardware (Intel HD 4000 and Iris Pro 5200) i'm running into some big bandwidth limitations.
2000MB/second is the highest i'm seeing and it is even less than that on the Iris Pro (1800MB/second)
This is benchmarked using a simple utility that copies from a system memory buffer to an offscreen surface via LockRect.

Transfer took 4243 ms ( 1864 MB/sec )
OpenCL benchmarks are at least 8000MB/sec by comparison.

Host to Device Bandwidth, 1 Device(s), Paged memory, mapped access
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 8473.5
What could be causing this bottleneck? I should be much faster as the CPU and GPU both share the same memory.

Test code below:

IDirect3D9Ex * d3d;
   IDirect3DDevice9Ex * dev;
   IDirect3DSurface9 * surface;
   void * buffer;
   int width = 1920;
   int height = 1080;
   int copies = 1000;
   int bufferLen = width * height * 4;

   int hr = Direct3DCreate9Ex(D3D_SDK_VERSION,&d3d);
   if(FAILED(hr)) {
       printf("Failed to Create Direct3D9Ex: %d",hr);
       return 1;
   }

   D3DPRESENT_PARAMETERS p;
   p.AutoDepthStencilFormat = D3DFMT_UNKNOWN;
   p.BackBufferCount = 2;
   p.BackBufferFormat = D3DFMT_A8R8G8B8;
   p.BackBufferWidth = width;
   p.BackBufferHeight = height;
   p.EnableAutoDepthStencil = FALSE;
   p.Flags = 0;
   p.hDeviceWindow = 0;
   p.MultiSampleQuality = 0;
   p.MultiSampleType = D3DMULTISAMPLE_NONE;
   p.PresentationInterval = D3DPRESENT_INTERVAL_DEFAULT;
   p.SwapEffect = D3DSWAPEFFECT_DISCARD;
   p.Windowed = TRUE;
   p.FullScreen_RefreshRateInHz = 0;

   hr = d3d->CreateDeviceEx(D3DADAPTER_DEFAULT,D3DDEVTYPE_HAL,NULL,D3DCREATE_MULTITHREADED | D3DCREATE_HARDWARE_VERTEXPROCESSING,&p,NULL,&dev);
   if(FAILED(hr)) {
       printf("Failed to Create Device: %d",hr);
       return 1;
   }

hr = dev->CreateOffscreenPlainSurfaceEx(width,height,D3DFMT_A8R8G8B8,D3DPOOL_DEFAULT,&surface,NULL,0);
   if(FAILED(hr)) {
       printf("Failed to Create Surface: %d",hr);
       return 1;
   }

buffer = malloc(bufferLen);

   LARGE_INTEGER freq;
   LARGE_INTEGER t1, t2;
   LONGLONG milliseconds;
   LONGLONG bandwidth = 0;
   QueryPerformanceFrequency(&freq);
   QueryPerformanceCounter(&t1);

   D3DLOCKED_RECT r;

   memset(buffer,255,bufferLen);

printf("Copying %d %dx%d 32bit surfaces...\n",copies,width,height);

   for (int i=0;i<copies;i++)
   {
       surface->LockRect(&r,NULL,0);
       memcpy(r.pBits,buffer,bufferLen);
       bandwidth += bufferLen;
       surface->UnlockRect();
   }

   QueryPerformanceCounter(&t2);
   milliseconds = (t2.QuadPart - t1.QuadPart) * 1000 / freq.QuadPart;
   bandwidth = ((bandwidth * 1000) / milliseconds) / 1048576;
   printf("Transfer took %I64d ms ( %I64d MB/sec )\n",milliseconds,bandwidth);

Michael_C_Intel2 · ‎01-06-2015

Hi Martin,

I am unaware of any bandwidth limitation issues in D3D 9 on HD Graphics, so I need to do some investigating. I hope to have some information for you soon.

-Michael

Michael_C_Intel2 · ‎01-08-2015

Hi Martin,

I talked with one of our lead Direct3D driver developers about the issue you are seeing. I have some observations made during our talk.

It is possible that the offscreen plain surface you are creating are not being cached by the CPU. Likely if it is default pool. It will not be “CPU optimized” for direct access.

There could also be significant overhead associated with the LockRect call to the API. The code as written doesn’t need to do the lock/unlock per copy iteration. If you want to measure CPU copy overhead only, move the LockRect outside of the for loop.

He observed a bug in the logic. You don't appear to be using the pitch returned by LockRect. If for some reason the surface is tiled then there would be a stride associated with the surface as well.

Why do want want/need to directly manipulate default graphics memory in this way though. You could get more optimal performance using the Update functions I would think.

I hope this helps you.

-Michael

Martin_S_2 · ‎01-08-2015

Hi Michael,

Thanks for your response.

The program I am developing involves streaming a large number of raw video feeds for live GPU compositing and processing.
On NVIDIA and AMD cards I see decent memory bandwidth performance (3-4GB/sec) even though a PCI-E bus transfer is involved.
On the Iris Pro I would expect to see equal or even higher performance due to the use of shared memory.

I have tried a number of different ways of streaming the video to the GPU including using UpdateSurface and UpdateTexture from SYSTEMMEM and the performance is actually a little less (see below).

1. Moving the LockRect outside the loop results in the same performance:

Transfer took 4225 ms ( 1872 MB/sec )

2. Inside the loop for comparison:

Transfer took 4245 ms ( 1863 MB/sec )

3. Using a single 4096 byte system buffer as source to ensure CPU caching

Transfer took 3057 ms ( 2587 MB/sec )

4. If I use UpdateSurface from a SYSTEMMEM surface to a DEFAULT surface the performance is even less:

Transfer took 4474 ms ( 1768 MB/sec )

5. Copying to SYSTEMMEM only, no GPU copy

Transfer took 813 ms ( 9729 MB/sec )

It sounds like there is a large overhead in the Intel D3D9 driver somewhere as copying using D3D11 appears to be much faster.

(Even when doing the very convoluted setup of using D3D11 to do the copy to a shared surface being used by D3D9)

An insight you could provide would be appreciated, I can send a dedicated support request if that is more appropriate.

Regards,

Martin

Michael_C_Intel2 · ‎01-14-2015

Hi Martin,

sorry for not replying sooner, things are quite busy right now. I am talking to some driver developers to see of we can figure out what is going on. If it looks like a driver issue of some kind I will file an internal ticket. I will let you know what we find out.

-Michael

Martin_S_2 · ‎02-17-2015

Hi Michael,

Any updates on this?

Thanks,

Martin

Michael_C_Intel2 · ‎02-18-2015

Hi Martin,

We are still looking into the issue. We are seeing some odd behavior such as during a copy to a surface allocated with D3DPOOL_SYSTEMEM there are no reads or writes to DDR3. This could be explained as LLC caching or remapping memory pages. Also during copy to a surface allocated with D3DPOOL_DEFAULT, Iris 5200 is reaching max DDR3 throughput which my be explained by tiling. We are running some experiments and continue to root cause the issue.

-Michael

Bernard · ‎02-18-2015

Did you try to run GPUView and Windows Performance Recorder?

Bernard · ‎02-19-2015

Basing my response on Michael Coppock post I would try to run WPR and observe CPU load during execution of LockRect call. Maybe there are synchronization issues which can contribute to slow memory bandwidth.Anyway you should perform system-wide testing with aforementioned tools.

Martin_S_2 · ‎03-05-2015

Hi Michael,

Any progress yet in finding a solution?

iliyapolak,

The bottleneck occurs during inside the memcpy between the two pointers so none of the profiling tools available record what is happening in that time other than higher than normal CPU usage (when compared to copying between system memory)

Regards,

Martin

Bernard · ‎03-05-2015

@Martin S thanks for response.

Did you try to run VTune during the memcpy execution? As I was able to understand from your post high CPU usage is directly mapped to memcpy machine code.

Michael_C_Intel2 · ‎03-16-2015

Hi Martin,

Sorry for the delay in getting this figured out. I asked our performance developer team to take a look at this and after some investigation they provided a solution:

CPU access to D3DPOOL_DEFAULT resources can be quite slow and should generally be avoided in performance paths. It is often faster to use a D3DPOOL_SYSTEMMEM resource as an intermediate buffer for such CPU access—e.g. to upload a texture, use a D3DPOOL_SYSTEMMEM resource, Lock it and load the texture directly from disk to that resource, Unlock it, and then do a GPU BLT (e.g. UpdateSurface) to the desired D3DPOOL_DEFAULT resource.

-Michael

Martin_S_2 · ‎03-16-2015

Hi Michael,

Thanks for the response.

If you look at one of my previous posts you will see I already tried that solution.

"4. If I use UpdateSurface from a SYSTEMMEM surface to a DEFAULT surface the performance is even less:

Transfer took 4474 ms ( 1768 MB/sec )"

So the memory transfer performance is even less if I use UpdateSurface, so something else must be going on.

Regards,

Martin

Bernard · ‎03-17-2015

>>>The bottleneck occurs during inside the memcpy between the two pointers so none of the profiling tools available record what is happening >>>

VTune should be able to read CPU counters during memcpy slowdown.

CPU to GPU copy on Direct3D9 really slow