In current SW lot of time every app is spending walking Maps and Sets (besides arrays, those are most often used data structures). I think this is place where CPU can provide enormous acceleration with specialized design and instructions for these data sets. Here is one idea how to do it:
1) "HW map Unit" consisting of, say, 10.000 64-bit integer triplets (MapID, key, value). Application allocates map/set of size X, and if there is enough free places in HW map, new MapID is allocated for application. Triplets are then populated by key/values (or only keys, in case of sets). In case there is no enough triplets, application resorts to classical SW implementation.
2) Lookup with HW map/set can search over all keys in parallel (compare implemented for each MapID/key), therefore providing incredible speedup. Also, it doesn't need to use cache memory, because it is sort of cache memory itself, so all other SW on machine obtain increased amount of cache that SW map/set would otherwise utilize.
3) For the beginning, this map and set could be used inside (Linux?) OS alone. Using it in, say, Java (Android), .Net and iOS would be harder, because maps and sets is these frameworks need to follow ALL standard interfaces, iterators and functions, which can be challenging to implement in HW map/set unit library implementation.
- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing
Content Addressable Memories are probably too power-hungry and large to be a standard processor feature, and (as noted above) can be very slow to emulate in software, but this might be an interesting use case for an on-die FPGA.
A, or a set of, properly constructed hash maps might produce what your want now.
The issues you have in defining the hash key in the case of the hash table are the same as what in the data set that is use for the content that is addressable. While software hash tables generally work with collisions, this may not be the case with hardware content addressable memory.
I think such extension, as proposed in the OP, would have a very limited usefulness. Real world code is rarely as simple as std::map<int, int> - more complex key/value types as well as custom ordering predicates are common. Not to mention that there are other associative containers, tree-based and not.
What could be improved though is memory prefetching. AFAIK, for current CPUs memory accesses during binary lookup in a tree are mostly unpredictable, and as such suffer from memory stalls. This is more so the case when keys are complex (e.g. strings). Some hinting might help to prefetch adjacent nodes while the ordering predicate is applied. The current prefetch instructions are supposed to help in this respect but unfortunately are too difficult to use and unreliable.
One of the problems with prefetching is the cache system typically uses TLB's (Translation Lookaside Buffers). If (when) your linked list traversal or tree traversal exceeds the limited capacity for the TLBs (at each cache level), then a cache miss will require a TLB to be replaced (in addition to the data that was missed). This event requires additional memory fetches to load the appropriate page table entries. Depending on the core's memory controller, it may refuse to perform the prefetch should the (or a portion of the) page table entry in the TLB be missing. Some of this extra overhead can be changed (to the better or to the worse) by changing the page size from/to 4KiB to 2/4MiB. YMM.
If your requirements are to use linked lists, then consider splitting the node into two parts: one part for link and key, and a second part for the remainder of data. Allocate one blob (slab) as a private pseudo-heap for links and keys, and a second blob (slab) as a private pseudo-heap for the remainder of data in each node. The corresponding respective struct index ties the two pieces together.