I'm trying to hide access latency to DRAM through simultaneous access.
If I understand the CPU architecture correctly, one should be able to issue multiple loads simultaneously as long as there is no data dependency. So my idea was to prepare an array with indexes I want to lookup and issue multiple reads and wait only for time of one lookup plus time to execute all lookup instructions. But I don't see much speedup from this approach. At least I'd expect AVX gathers would help to hide that latency. Next I tried to prefetch data, but again without no luck.
I'm reading 128 bit values aligned to 8 bytes. DRAM bandwidth is only about 40 GB/s (4 channel overclocked memory, disabled prefetching, huge pages).
Am I hitting some in-flight loads limit or something? Are next architectures (Knights Mill?) supposed to speed up these kind of
Most Intel processors have a limit of 10 concurrent L1 Data Cache Misses per physical core, though I think that might have been increased to 12 for the Xeon Phi x200 (Knights Landing) processor. In many cases this is not enough concurrency to fully tolerate the DRAM latency. For contiguous (or nearly contiguous) accesses, the L2 Hardware Prefetchers can provide additional concurrency to allow a processor to approach full memory bandwidth.
I would need more specifics to analyze your results, but an example from a similar system may be useful. On my Xeon E5-2690 v3 (Haswell EP, 12 core, 2.6 GHz nominal) systems, I see unloaded local DRAM latency of about 85 ns. The peak DRAM bandwidth is 68.266 GB/s (4 channels of DDR4/2133) per socket. This gives a latency-bandwidth product of 5802 Bytes, or 91 cache lines. So if your code is only able to generate demand L1 Data Cache Misses, it would take at least 9 cores to generate enough cache misses to get close to full bandwidth.
In practice, one often finds that some interaction of the code and the microarchitecture prevents you from getting all 10 L1 Data Cache misses from each core. For recent processors, the performance counter event L1D_PEND_MISS.PENDING (Event 0x48, Umask 0x01) is intended to be used to measure the average number of L1 cache misses pending. I have not tested this event, but it should be fairly easy to set up a set of microbenchmarks to see if the results are reasonable.
If you are using many cores, queuing delays in the memory controller will increase the average latency, making latency-bandwidth product (as defined above) understate the amount of concurrency required. The Intel Memory Latency Checker can be used to measure this effect ("loaded latencies") for various memory read/write combinations.