Ok, I have found the solution

andrasjpeg · ‎11-19-2014

Reading from a mapped buffer is significantly (~40x) slower than reading from system memory. This does not include the mapping call itself (which might block), only the actual memory read instructions. I use movdqa SSE instructions to read aligned 16 byte chunks, and used RDTSC to measure the speed. Running the same code, but with a different source pointer (one that points to system memory), the code runs much faster. This would make sense, if I used a discrete GPU with its own memory, in which case I'd have to go through the bus to read it. But I'm using Intel HD 5000, which I thought simply uses a reserved chunk of the system RAM. Is this normal behavior? Any suggestions for reading back data faster? We already use double buffered PBOs (in fact, we have a ring of PBOs) and asynchronous ReadPixels and read the mapped data in a separate thread. But as I said, driver blocking is not the issue here.. I'm using an Intel NUC with HD 5000 GPU, Win8.1 x64, driver 10.18.10.3960. Any help or advice would be appreciated! Andras

andrasjpeg · ‎11-28-2014

Ok, I have found the solution. It turns out that GPU memory is mapped as USWC (Uncacheable Speculative Write Combining) memory, and to read from such memory fast, you have to use the MOVNTDQA instruction! It's incredible, but simply replacing MOVDQA with MOVNTDQA can increase load performance tenfold or more! Here's a good reference on the subject: https://software.intel.com/en-us/articles/increasing-memory-throughput-with-intel-streaming-simd-extensions-4-intel-sse4-streaming-load/

glMapBuffer: reading mapped memory is very slow