after reading this article about the I/OAT DMA engine, I have been trying to build an I/OAT-powered scan operation. The basic principle is dividing the input data into equal-sized chunks, which are then copied sequentially into one of two local chunk-sized buffers in an alternating fashion. We are using two buffers because in the end we would like to process one buffer while the other is being loaded by the DMA engine (and then swap them), essentially overlapping data transfer and computation.
However, we have noticed that the DMA-copied data in the local buffer is somewhat volatile and the results vary with every iteration. The program in the attached file uses the aforementioned approach to scan a large area of memory chunk-by-chunk, counting the occurrences of the number 5. This is also tested with data from multiple NUMA nodes - in this case 0, 2, and 5. And, at least on our test system, the result varies with each iteration:
[Data on 0] sequential I/OAT result: 16384 result: 16074 result: 16388 result: 16335 result: 16168 [Data on 2] sequential I/OAT result: 1496 result: 16330 result: 16233 result: 16211 result: 16388 [Data on 5] sequential I/OAT result: 16081 result: 16150 result: 16104 result: 16211 result: 16066
The attached file could be compiled by just adding it as an executable target to the "ex4" directory of the blog post in the link above. Alternatively, it should compile with the following command line (given that the libraries and headers are available in the correct paths):
g++-7 -std=c++14 -fno-strict-aliasing -march=native -m64 -D_GNU_SOURCE -fPIC -fstack-protector -Wl,-z,relro,-z,now -Wl,-z,noexecstack ioat_inconsistencies.cpp -o bug_report -lnuma -lspdk_ioat -lspdk_util -lspdk_env_dpdk -lspdk_log -lrt -ldl -lrte_eal -lrte_mempool -lrte_ring -pthread
Can somebody confirm this inconsistent behavior on their machine? Or am I missing something about how the DMA memory works in general?
Thanks in advance for any help
In your test code, you are using copy_done[idx] as a flag to determine when the copy is complete, but after the first two iterations of the chunk_count loop, the copy_done and copy_done values will already be 1 by the time you call spdk_ioat_submit_copy(), so the spdk_ioat_process_events() loop will probably exit early (before the copy is done).
Re-initializing copy_done[idx] to 0 before calling spdk_ioat_submit_copy() makes the test code work for me.