This may seem a little out of context, but here is a thought:
Texas Instrument DSPs and many other performance oriented (embedded) CPU designs allow the programmer to dedicate a part of the CPU cache to become addressible as "local memory" which is much faster than global memory.
Memory bandwidth becomes a serious problem as the core count is increased (especially for linear algebra math which highly depends on the memory bandwidth).
This "local memory" is also present on the GPUs. Having the ability to configure 12MBytes of L3 cache on Xeon CPU out of a total 24MB as "local memory" would give an impression of honey and milk as far as the eye can see. Especially if one considers that in most todays high end CPU core designs this local memory is less than 128KBytes.
The feature request to Intel is therefore:
Please consider allowing the programmer to use user definable amount of L1, L2 or L3 cache as "local memory" and thus make the hardware platform reconfigurable for various types of loads. This would also enable introduction of AVX v2 (doubling the register width), whose performance would scale nearly linearly with number of cores with much less dependence on the global RAM.
Ok, I now : ) But we can dream and maybe you can give some thought to it.
I'm not sure the x86 processors have support for doing that. On the PowerPC, there was support, and you needed to configure various registers (including a base address register for that memory) to exploit some of the L2 or L3 cache as "private memory". Using this memory would not go to the same-level cache or the memory bus. I don't think it would be possible w/o either that kind of feature, or the ability to "lock" some cache lines in a specific level of cache.
If it is possible in the hardware, it would be a useful feature to be able to specify how much you need, instead of letting the runtime select an amount and report it back through the query stuff.