For a memory bandwidth limited application, you can expect the total CPU time used by 2 cores to increase significantly over the time used in a serial run. Even in an ideal parallel situation, you are looking for a decrease in elapsed time, not a decrease in total CPU time. In this case, a small increase in elapsed time is possible, as you may be increasing the number of times some of the data pass between memory and cache.
If your outer k loop is such that the compiler could recognize it as a do-nothing loop, it might be optimized away to a different extent in an openmp vs non-openmp compilation. Normally, you would go through a non-optimizable function after each execution of the inner loop, to avoid that effect.
The following comments are based on my assumptions of what you are trying to do. My lack of understanding of your code may make my comments meaningless....
A good way to handle any multi-threaded problem is to eliminate (as much as possible) lock conditions.
Think of each of your hash table entries as having 3 states:
Has value previously computed
Computation in progress
Have each thread have a queue of pending operations (initialized to none pending)
Currently when your code "ponders" a position it computes the hash code of the position under consideration, consults the hash table, if the hash table entry for the position indicates no value then a call is made to produce the value (currently you are locking (wanting to lock)this entry until computation is complete), the entery is inserted into the hash table, endif.Here if/when hash table has result from prior computaton, The hashtable entryis used in some function toadvance towardssolution.
In your first attempt (lock per hash table cell), the 2nd thread attempting toconsider the same position would block until the thread producing the results completes. But you ran out of locks in the first attempt. In the second attempt it appears you resulted in both threads seeing the entry for the hash code as empty and both threads computed the result.
Try seeing if you can change the code such that the hash code table entries have three states (Never used, Has result, Pending). If Has result - use it, if Never used - use InterlockedCompareAndExchange to set state to Pending (and to detect if other thread beat you to setting pending) and compute result, if Pending - create entry into the new queue I mentioned used for reconsider position at alater time- then proceed to consider new position. Periodically, or as you unwind your recursion, look at the queue to see if there are any positions to reconsider.
If at all possible, code without using Mutex, critical sections, and other blocking code. Use the InterlockedXxxx collection of function calls.
Try amd.com and look for CodeAnalyst. Timer based sampling will run on Intel processors. They may have a Linux version (I use the Windows version).
You should be able to locate hot spots.