fastest available intel CPU for use with TBB?

git_g_ · ‎09-04-2013

I would like to choose the fastest available intel cpu for my multi-threaded program which highly uses TBB.

I was thinking of 2 of 8 cores CPUs maybe? Xeon E5, or E7 ? maybe i7 3rd or 4th generation?

What would you suggest?

jimdempseyatthecove · ‎09-05-2013

What is your budget? What is your problem size?

For 1 socket consider Intel Core i7-4960X or Core i7-4950HQ

(see reviews: http://www.tomshardware.com/reviews/core-i7-4960x-ivy-bridge-e-benchmark,3557.html and http://www.anandtech.com/show/7255/intel-core-i7-4960x-ivy-bridge-e-review)

Look at the multi-threaded charts.

If you are going multi-socket then consider the Xeon line

Jim Dempsey

git_g_ · ‎09-05-2013

I am working towards real-time scene understanding. People usually go for gpus in my case but today I am considering tbb and cpus.

I am not certain if we should favor the number of cores (8/10 vs your suggested 6 and 4) or frequency, memory, etc.

I know that many parameters can influence the performance (if you know some important ones please tell me), but just want to make sure if more cores = better tbb performance?

jimdempseyatthecove · ‎09-05-2013

>>I am working towards real-time scene understanding.

The choice of frame grabber and frame size may factor more. It would be more efficient if the frame grabber does not involve a CPU thread in a driver to copy data from an internal buffer to application memory buffer. You would like this copy operation to be hardware driven. If it is not, then this will introduce overhead on one of the available hardware threads of the CPU.

The application thread communicating with the driver to the frame grabber should not be one of the TBB task pool threads. IIF (if and only if) the frame grabber driver performs a software copy to buffer, then consider under-subscribing the TBB thread pool by 1 thread.

You haven't stated what you are using for a camera. Or how many cameras. I assume you will initially test with 1 camera 1920 x 1080 (not known number of bits for color). Each frame in this case will be on the order of 2-8MB. You will want to consider reducing number of bytes per pixle (when appropriate) to permit more frames in L3 cache. When you exceed L3, look for a processor line that has 4-channel memory access (as opposed to 2 channel).

Scene understanding will likely require both intra-frame and inter-frame analysis. Your application design may require multiple (staggerd) concurrent parallel pipelines:

Image reduction (e.g. 4 bytes per pixle to 1 or 2 bytes per pixle)
Intra-frame analysis
Inter-frame analysis A
Inter-frame analysis B
results processing

The input stage of the image reduction receives frames from the frame grabber.
The input stage of the Intra-frame analysis receives frames from the output of the immage reduction
The input stage of the Inter-frame analysis A receives adjacent frame pairs from the output of the immage reduction
The input stage of the Inter-frame analysis B receives adjacent frame pairs from the output of the Inter-frame analysis A
The input stage of the results processing receives input from the output of the Inter-frame analysis A and B and possibly original un-reduced image from Image reduction

It is unknown (to me) as to how best to attribute an optimal number of threads for each pipeline. You may not know until you VTune the application. I imagine much of your processing will be integer and therefore will take advantage of the HT capabilities.

Have you implimented your software on your current hardware platform?
If so, then have you performed a VTune analysis?

This may me instructive in selecting a better platform.

Jim Dempsey

jiri · ‎09-07-2013

One important question that you have to ask yourself is: "How well does my program scale?" In other words, will doubling the number of cores increase the performance by 30% or 95%? In the first case, you probably want to get a CPU where each core is as fast as possible, even if it means having e.g, 6 cores instead of 8. If the second case, you almost certainly want as many cores as possible, since adding extra cores tends to be "cheaper" than improving single-core performance. You may even consider the Xeon Phi coprocessor, since you may be interested in the powerful vector units on that chip.

A separate question is, whether you are more interested in pipeline parallelism or something else, like loop-level parallelism. Image processing is usually connected with the first one, but since you mentioned real-time scene understanding, you may want the evaluation to be finished by the time the next frame arrives, in which case the pipeline isn't what you are looking for.

Also be sure not to miss the advice about caches in the previous post, since the difference between having the whole frame in the cache and having just half of it in the cache can be really huge (e.g., performance that is several times better for the all-in-cache case).