Looking to see if anyone can shed some light on this for me.
I'm the user of a real time audio mixing application that is written in assembly and runs on win32.
The developer says that he isn't able to make use of multicore systems beyond 2 cores to to the fact that threads running on cores will effectively stomp on the priority assignments of threads running on different cores.
This is his explanation regarding the problems he's facing:
Any threadon another core will stomp on any priority assignment of threads on different cores.
There appears to be no thread priority between cores. Any memory access on one core will shutdown other cores. Threads on different cores that access RAM will force a single core operation during that time.
In a real world situation, threads in this application will need to process variables stored in memory all the time which is what causes problems on multi-core systems.
Threads running on separate cores essentially have the same priority; the CPU knows nothing about thread priorities, it just runs what the OS tells it to run. However, threads running on separate cores should not affect each other unless they are trying to access the same chunk of memory. If this is the case then you will indeed see something like your are experiencing --- the cache line contention will cause all cores to slow down.
The solution is not to limit the number of cores, but to change the program so that it is not accessing the shared memory so often. Maybe copy the data to each thread beforehand, so each thread is independent, or use some other mechanism to limit how often the threads access the shared memory.
You can use some simple techniques to reduce the number of read/writes. e.g. raw data may come in as 10/12/14/16 bits from an ADC passed through 16 bit reads. Internally you may be able to process the data as 2x16 (32-bits) or 4x16 (bits)or 8x16 (SSE) units thus reducing memory read/writes. Additionally you could employ a lossless compression
3x10bits in 32 bits, 6x10 in 64 bits, 5x12 in 64 bits
or, depending on data, you can store 8-bit deltas with some loss of data for "pops" and "gaps" (although you could incorporate an escape to extend the number of bits for the large swings.