I created test code to measure the sustainable bandwidth of reads on Intel Skylake 6700K, when using AVX instructions, by looking at Agner's test program. This test measures the amount of clocks spent executing memory transfers by using rdtsc and rdtscp according to https://www.intel.de/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-...;
To perform the benchmark I utilize inline assembly code, which iterates over an array (one per each thread). This array is created with _mm_malloc to align data to 32 bytes. (code source files are provided in attachment). Besides, the test code is repeated a certain number of repetitions until it runs a predefined time duration.
For single thread test, the bandwidth curve (orange curve) has the expected behavior, however for two threads (gray curve), there is a strange effect between L3 cache and DRAM. Since my single-thread results are fine, I know that it is somehow related with threads interaction, but I cannot identify the problem (I do not have much experience in multi-threaded programming).
Can someone identify the problem? Any help is welcome!