Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

How to avoid cache poisoning?

arandomguy
Beginner
1,021 Views
I'm trying to parallelize a sorting algoritm. It works fine most of the time and I've thought I found a solution several times but it seems I always have some chance of error no matter what I do.

To simplify a bit, I partition the array in place. Then I send off partitioning of each side to a job queue where it will be picked up and executed. The process is iterative so each job will spawn two more jobs until a certain limit is reached and then another sort type will kick in.

so it's something like

Job 1
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

Job 2 Job3
aaaaaaaaaaaaaaaaaaaaaaaaaaa

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa


Job4 Job5 Job5 Job6
aaaaaaaa

aaaaaaaaaaaaaaaa

aaaaaaaaaaaaaaaaa

aaaaaaaaaaaaaaaaaaaaaa

and so on.

It's plain to see that none of the jobs ever interfere with each other. The only job that can possibly write to the same area of a given job is its parent, and it never spawns the children until the partitioning is complete. I am "sure" that this is always true and that there's no other odd flaw cropping up, after extensive testing. So the only thing I can think of is that the cache coherency is shot and I need to find some way to force the CPU to write out its cache before it spawns its children.

As far as I can tell, what I need to do is create a memory fence but it has to be at the CPU level, not at the compiler level. I've been searching for how to do this to no avail - lots of sources point out this is what I need to do, but don't explain how.

Or, if that's not how I do it, how do I? I have come across a few instructions to flush the cache but they all seem to be privileged instructions so they don't do me much good. This seems like something vital for any nonblocking parallel code so there must be some simple answer but for the life of me I can't find one.






0 Kudos
7 Replies
arandomguy
Beginner
1,021 Views
Well, if I place a lock around the swap it works fine, but obviously that's very slow.

I tried to use __asm mfence; in the same place, but that is not doing anything. It does make it much slower so I know it's doing something but it's ineffectual.

Is there some more current instruction which I need? It seems completely crazy windows has nothing built in for this but it seems its memory barrier really only affects the compiler.

Perhaps this result means that it's the compiler doing this to me, though? Is that even possible in debug mode? I have no optimizations turned on whatsoever.
0 Kudos
arandomguy
Beginner
1,021 Views
Ok, my problem has worked itself out. It turns out the fence was working correctly (and I finally found how to do it through vc++ to boot). The problem was that I had accidentally deleted a lock on my queueing system and once in a blue moon it would hand off the same job to two threads and thereby skip another job.
0 Kudos
Tom_Spyrou
Beginner
1,021 Views


Debugging this sort of problem is always frustrating...

0 Kudos
robert-reed
Valued Contributor II
1,021 Views
Quoting - arandomguy
Ok, my problem has worked itself out. It turns out the fence was working correctly (and I finally found how to do it through vc++ to boot). The problem was that I had accidentally deleted a lock on my queueing system and once in a blue moon it would hand off the same job to two threads and thereby skip another job.

Ummm...., are you aware of Intel's Threading Building Blocks and the parallel_sort method it contains? It sounds like that implements the quicksort-style sort partitioning algorithm similar to whatyou have been experimenting with. In algorithms that force you to touch everything like the quicksort partition step, there's no way to avoid cache poisoning but you can ameliorate the problem by using parallel sort only for large sets where there's sufficient work to allow the partitions to drift to the HW thread assigned to their further processing. I think the grain size for switching between parallel partitioning and serial sorting in the TBB parallel sort is up around 500 elements or so.
0 Kudos
catroz
Beginner
1,021 Views


well i have had a similar problem with my cache geting poisoned somehowand i was completly mad about it, i dont know what i did , i messed with it around for a bit and 1 night after struggling and getting myself a huge headache i went to bed, next afternoon when i woke up(yes afternoon, headache was a huge 1) i turned it on and hey, it was solved, i believe it was the first miracle i have witnessed.


Catroz
Http://baixaquigratis.com
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,021 Views

On your lowest level of partitioning, add an _mm_sfence(); after exiting the outer iteration loopafter making last write, but before you exit your Job-n (or before setting a Job-n done semaphore). Without seeing your code it is difficult for us to make anything other than a generalized statement.
0 Kudos
mahmoudgalal1985
Beginner
1,021 Views

On your lowest level of partitioning, add an _mm_sfence(); after exiting the outer iteration loopafter making last write, but before you exit your Job-n (or before setting a Job-n done semaphore). Without seeing your code it is difficult for us to make anything other than a generalized statement.
Thank you
0 Kudos
Reply