Welcome to the Threading Forums!
Your mention of the write-combing buffers reminds me of a previous code I'd seen where this was a problem. In that case, the code was restructured to reduce the total number of buffers needed within the code segment. This was a loop that had several independent operations, so breaking the single loop into several smaller loops was relatively straightforward and the results were pretty astounding. I'm not sure if this would be applicable in your case or not.
One caution I should point out, though, be sure that you are using the physical processors on the system. WIth one physical processor, this is no problems. However, if you have 2 physical processors with HT turned on (4 logical processors), you still run the risk that the two threads created will end up on the same physical processor and you've now got the same problem as before. One would hope that the OS would be able to better schedule the threads.
You still might consider setting thread affinity to give a better chance of running two threads on separate physical CPUs.