Reading from a mapped buffer is significantly (~40x) slower than reading from system memory.
This does not include the mapping call itself (which might block), only the actual memory read instructions. I use movdqa SSE instructions to read aligned 16 byte chunks, and used RDTSC to measure the speed. Running the same code, but with a different source pointer (one that points to system memory), the code runs much faster.
This would make sense, if I used a discrete GPU with its own memory, in which case I'd have to go through the bus to read it. But I'm using Intel HD 5000, which I thought simply uses a reserved chunk of the system RAM.
Is this normal behavior? Any suggestions for reading back data faster? We already use double buffered PBOs (in fact, we have a ring of PBOs) and asynchronous ReadPixels and read the mapped data in a separate thread. But as I said, driver blocking is not the issue here..
I'm using an Intel NUC with HD 5000 GPU, Win8.1 x64, driver 10.18.10.3960.
Any help or advice would be appreciated!