Would this situation benefit from scalable_allocator?
The scalable allocator might help, but the only way to know for sure is to try it.
The way the scalable allocator works in this situation is that the block of memory will be allocated from the producer's heap. When the consumer frees it, the scalable allocator will see that it came from the producer thread, and send the block back to the producer to be recycled by the producer.
The reason I didn't just try is that the thread profiling tools for Debian/Ubuntu are really poor. Having an understanding of how it would help was an important start for me.
[I'd love to buy VTune and Thread Checker but I've never been successful running them on Debian].
Is the nature of your producer-consumer such that you can't reuse thebuffers when you're finished with them? If you set up another concurrent_queue to hold the free buffers processed by the consumer, the producer could reuse them. It would cost less than freeing and then reallocating them in general (presuming lots of things), which could shave some time when there's lots of buffers in use. Reuse thebuffers if they're available, else allocate new ones.
The most recent release of Intel Thread Checker for Linux works on Fedora Core 6 , according to the release notes. How long has it been since you last tried it? Did you pursue those failures with the product support groups (threading forum or Premier.intel.com)?
re #2: I've tried installing them myself with multiple versions other the products' lifetimes. There are also several support threads about debian support (particularly regarding vtune). I never used premier.intel.com regarding those products since I was using an unsupported OS.
Here's an example of the 'best' solution I have found -- I haven't had an opportunity to try it though:
I was finally able to profile this properly using vtune on Fedora (and just copying all the dependent libs of my program to that system).
All my worry about memory allocation was premature optimization... most of the program time was spent on the consumer spinning on concurrent_queue::pop. To rememdy this, I changed to use pop_if_present and then wait on a boost::condition that is notified in the producer. This dramatically improved program performance -- from ~100% CPU load to ~10% CPU load.
There is another forum topic related to this concurrent_queue CPU spinning that is a pretty interesting read. I sorta felt that my final solution was sub-optimal, but it was very quick to implement.
I'm glad to hear of your success, both in sidestepping the Debian issue and finding a reasonable solution to your problems.
If you're referring to this forum thread, I'm intimately familiar with that conversation . As noted in the documentation and as rediscovered in your experiments, concurrent_queue is not well suited to holding locks for any significant duration because ofthe spin-lock implementationbut works faster than other methods when the queueaccess is active. The CPU utilization numbers you quote are typical.
And thanks for letting us know how it all turned out.