It's the fastest way. I.e. direct write in one thread, and spinning on load in another thread. Yes, it's hundreds of cycles, there is nothing you can do with that (if you need physical movement of data).
There is only 2 options to accelerate it. (1) Schedule both threads to the same core (no physical concurrency in this case), or (2) batch messages - you can physically transfer up to 64 bytes for the same cost.
Btw, you should use PAUSE instruction for spin loops instead of NOP.