Handling Branch Predictor and Instruction Cache Misses in non-deterministic Parallel Programming

Bryan_K_1 · ‎04-08-2016

Hi.

I've been working on a library that implements mostly transparent automatic parallelization however when I benchmark it, I only see about 50% CPU utilization on all cores. Benchmarking system resources shows that the main memory is active at ~600MB/s. I was able to increase CPU usage to about 70% (and also greatly increase performance) by using the gcc __builtin_prefetch() but main memory still shows ~600MB/s.

As with all parallel code, I used gprof to generate a call graph with time percentages and the majority of the overhead is from pointer indirection in a single spot (https://github.com/bk5115545/The-Event-Engine/blob/EngineLogging/include/event_system/Subscriber.h#L74). I have verified that the target function optimizes to a nop when using -O3 so I have included the below code to attempt to prefetch the instructions into the instruction cache but I can't find any assembly directives which allow me to hint which memory should be loaded into the instruction cache when disparate code paths are about to be executed.

Windows performance counters also show about a 60% L2 hit rate and based on what I've been reading this is very low.

https://github.com/bk5115545/The-Event-Engine/blob/EngineLogging/source/event_system/Dispatcher.cpp#L215-L236

for (unsigned int i = 0; i < thread_cache.size(); i++) {
     try {
           // std::cerr << "Thread try_call try." << std::endl;
           std::pair<Subscriber*, std::shared_ptr<void>>& work = thread_cache.at(i);

           // put first cache-line (between 32 and 64 bytes) of next function into L2-d cache (which is HOPEFULLY
           // faster than referencing main memory)
           // but it would be a lot nicer if x86 had instructions to prefetch into the instruction cache
           // for rare but time-sensitive code paths
           if (i + 1 < thread_cache.size() - 1)
               __builtin_prefetch((thread_cache.at(i + 1).first->target_for_prefetch()), 0, 1);

           work.first->call(work.second);
           // std::cerr << "Thread try_call success." << std::endl;
       } catch (std::string e) {
           std::cerr << "Exception thrown by function called by Dispatcher Threads." << std::endl;
           std::cerr << e << std::endl;
       }
}

I have a few questions. First, is there a way (even vendor specific) to hint to the cache engine that a rarely-called function will be called soon? Second, is there an Intel processor on the market that allows for complete software control of the cache (I know that the cache is shared but I will need this for the next phase of my project)? Third, I'm trying to test the feasibility of a new type of architecture using software emulation; I'm also currently in college; what is this type of work called?

Thanks for your help

Bryan_K_1 · ‎04-13-2016

Is there anyone here familiar with explicit caching using the IA-32 or Intel 64 architectures?

I was reading more of Volume 3 of the "Intel 64 and IA-32 Architectures Software Developer's Manual" and I found a section that states that the PREFETCHh instruction should never be used to fetch code however my above program gets about a 15% performance increase when using gcc's __builtin_prefetch (which compiles to PREFETCHh with some added checks).

What's the reason for this limitation since the instruction appears to do something close to what I want? Does it sidestep the TLB and force a retranslation of the virtual address space or is it something more time consuming?