Appreciate your time. I am working on updating some computationally intensive c code I wrote 12 years ago. As many people have pointed out "the free lunch is over".
My objects (will) contain 1xint64, 9x__mm128, 8xdouble, and 1xbool.
The algorithm is a fairly typical numerical simulation of a) an initial setup, b) followed by millions of lookups and calculations between those objects, c) a few additions and deletes, d) resort the array (If I am using a sorted vector) and e) go back to step b. Ive looked at std::multiset, boost::flat_multiset and right now am also working with a simple sorted std::vector which actually works quite nicely using the tbb::parallel_sort to update.
Im running on an Intel i7 machine and have been able to structure my code so I can use parallel_for safely. I am getting a speedup of about 4 which flattens out after 5 threads are being used.
Right now about 35% of the time is spent in the actual calculations and therefore my interest in the SSE calls. I see great potential there. Another 35% is spent in the lower_bound function looking up the elements - the majority of which should be in the division of the current task in the parallel_for. Im guessing that the basic lower_bound function is jumping all over the memory in its binary search.
The container needs to be able to look up elements insanely fast by a key value, work with __m128 Intel SSE calls, and also work with Intel TBB::parallel_for, and run on an Intel i7 machine.
Is if fair to summarize that if I am going to use the SSE* calls I can not use the std library containers unless I use your trick with pointers mentioned above. Any containers that I do write I need to pay attention to memory allocations etc for them to work optimally with the TBB parallel functions.