To simplify a bit, I partition the array in place. Then I send off partitioning of each side to a job queue where it will be picked up and executed. The process is iterative so each job will spawn two more jobs until a certain limit is reached and then another sort type will kick in.
so it's something like
Job 2 Job3
Job4 Job5 Job5 Job6
and so on.
It's plain to see that none of the jobs ever interfere with each other. The only job that can possibly write to the same area of a given job is its parent, and it never spawns the children until the partitioning is complete. I am "sure" that this is always true and that there's no other odd flaw cropping up, after extensive testing. So the only thing I can think of is that the cache coherency is shot and I need to find some way to force the CPU to write out its cache before it spawns its children.
As far as I can tell, what I need to do is create a memory fence but it has to be at the CPU level, not at the compiler level. I've been searching for how to do this to no avail - lots of sources point out this is what I need to do, but don't explain how.
Or, if that's not how I do it, how do I? I have come across a few instructions to flush the cache but they all seem to be privileged instructions so they don't do me much good. This seems like something vital for any nonblocking parallel code so there must be some simple answer but for the life of me I can't find one.
I tried to use __asm mfence; in the same place, but that is not doing anything. It does make it much slower so I know it's doing something but it's ineffectual.
Is there some more current instruction which I need? It seems completely crazy windows has nothing built in for this but it seems its memory barrier really only affects the compiler.
Perhaps this result means that it's the compiler doing this to me, though? Is that even possible in debug mode? I have no optimizations turned on whatsoever.
Ummm...., are you aware of Intel's Threading Building Blocks and the parallel_sort method it contains? It sounds like that implements the quicksort-style sort partitioning algorithm similar to whatyou have been experimenting with. In algorithms that force you to touch everything like the quicksort partition step, there's no way to avoid cache poisoning but you can ameliorate the problem by using parallel sort only for large sets where there's sufficient work to allow the partitions to drift to the HW thread assigned to their further processing. I think the grain size for switching between parallel partitioning and serial sorting in the TBB parallel sort is up around 500 elements or so.
On your lowest level of partitioning, add an _mm_sfence(); after exiting the outer iteration loopafter making last write, but before you exit your Job-n (or before setting a Job-n done semaphore). Without seeing your code it is difficult for us to make anything other than a generalized statement.