Re: Scaling on multi-core Xeon CPU

AndrewC · ‎12-30-2009

I am testing on a dual , quad-core Xeon 5506 (64 bit MKL 10.2, Windows XP64) Profiling the code in sequential mode shows that a very significant amount of time is spent inside MKL inside *gemm3, so I was expecting some decent scaling as I increased OMP_NUM_THREADS from 1 to 8

Number of threads Elapsed Time Process Time
1 820.577 819.703
2 596.265 1035.69
3 527.077 1350.08
4 491.907 1640.97
5 475.305 1856.08
6 460.596 2097.59
7 454.632 2312.84
8 449.244 2623.67

I just don't really understand this. It appears MKL is keeping all 8 cores 'busy', but they must be just spinning their wheels. Are there any Intel tools that will help me figure out what is going on here?

Andrew

TimP · ‎12-30-2009

If you link with libiompprof5 (/Qopenmp_profile, if using ICL or IFORT), a performance summary of each threaded region should be written in guide.gvs. This will show the balance of work and barrier times among threads. Compare with various appropriate settings of KMP_AFFINITY environment variable, e.g. SET KMP_AFFINITY=compact,0,verbose. Check the echo to see that the core numbering has been understood.
For 2,4,6 threads try also with the threads split 50-50 between processors, but not alternating.
For a GUI display, the guide.gvs may be imported into VTune, or Thread Profiler could be used.
VTune or PTU event sampling should enable you to get more detail, assuming there is cache capacity limitation. The reduced cache size and memory bus capacity of the 5506, compared with full featured models, may become a handicap at some problem size in ?gemm. Then, it would be interesting to compare the results when you use ifort to compile your ?gemm from public source with debug symbols enabled.

Gennady_F_Intel · ‎12-30-2009

Quoting - vasci_intel

I am testing on a dual , quad-core Xeon 5506 (64 bit MKL 10.2, Windows XP64) Profiling the code in sequential mode shows that a very significant amount of time is spent inside MKL inside *gemm3, so I was expecting some decent scaling as I increased OMP_NUM_THREADS from 1 to 8

Number of threads Elapsed Time Process Time
1 820.577 819.703
2 596.265 1035.69
3 527.077 1350.08
4 491.907 1640.97
5 475.305 1856.08
6 460.596 2097.59
7 454.632 2312.84
8 449.244 2623.67

I just don't really understand this. It appears MKL is keeping all 8 cores 'busy', but they must be just spinning their wheels. Are there any Intel tools that will help me figure out what is going on here?

Andrew

Regarding scaling you may try to vary KMP_BLOCKTIME variable to make OpenMP threads react faster by increasing this variable.
While smaller block time may increase performance for non-OpenMP threaded code between regions.

But in many cases on FSB based systems the application bandwidth appetite causes a non-scaling.
If your Xeon 5056 is FSB based?
PTU is able to pinpoint the bandwidth problem on FSB based systems (with Core2 Bandwidth profile configuration )
As well as show the run-time balance between threads and other micro-architectural issues.

On the other side I doubt that this is gemm causing non-scaling so rapidly (at 3-4 threads) as it does multiplication by blocks.

BTW what are matrix sizes in your test?

--Gennady

TimP · ‎12-31-2009

Xeon 5506 is a Nehalem architecture with reduced QPI performance, half normal cache size, and no HyperThreading or Turbo mode. If the MKL blocking is effective for the smaller cache, and KMP_AFFINITY=compact is set, I would think that could compensate for the slower QPI. Does MKL assume that all Nehalem platforms have the larger cache?
I spent some time trying to understand Gennady's comment about KMP_BLOCKTIME. I think he means that small matrices, with relatively short parallel execution times, and significant serial execution times between, could benefit from reduced KMP_BLOCKTIME. But I suppose the matrix has to be fairly large to use 8 threads, as your timings indicate may be happening. I assume OP would tell us if _OMP_NESTED were in use.
As Gennady said, it's not possible to make relevant comments without knowing whether the question is about many small matrix operations or large enough ones that cache size becomes an issue.

AndrewC · ‎01-04-2010

Quoting - tim18

Xeon 5506 is a Nehalem architecture with reduced QPI performance, half normal cache size, and no HyperThreading or Turbo mode. If the MKL blocking is effective for the smaller cache, and KMP_AFFINITY=compact is set, I would think that could compensate for the slower QPI. Does MKL assume that all Nehalem platforms have the larger cache?
I spent some time trying to understand Gennady's comment about KMP_BLOCKTIME. I think he means that small matrices, with relatively short parallel execution times, and significant serial execution times between, could benefit from reduced KMP_BLOCKTIME. But I suppose the matrix has to be fairly large to use 8 threads, as your timings indicate may be happening. I assume OP would tell us if _OMP_NESTED were in use.
As Gennady said, it's not possible to make relevant comments without knowing whether the question is about many small matrix operations or large enough ones that cache size becomes an issue.

The core matrix multiplications, in this case, are on complex matrices 394x768 * 768x768. My next step will be to extract this core matrix operation and make a simple test case I can share that calls zgemm3 and see how it scales vs. when buried in my application code and report back.

AndrewC · ‎01-04-2010

Quoting - vasci_intel

The core matrix multiplications, in this case, are on complex matrices 394x768 * 768x768. My next step will be to extract this core matrix operation and make a simple test case I can share that calls zgemm3 and see how it scales vs. when buried in my application code and report back.

I did some tests with sample code calling zgemm3 , and it looks like I need to do closer examination of my code for thread scaling bottlenecks, as zgemm3 itself scales fairly well as the matrices get larger.

The tables are number of threads vs the ratio of wall clock time for 1 thread. N is the dimension of a square DComplex matrix

For N=16
1 1
2 1.382634633
3 1.725415255
4 2.661113465
5 2.048479939
6 1.620603122
7 1.791383007
8 1.979928082

For N=256
1 1
2 1.921579502
3 2.478885332
4 2.819896679
5 3.198962239
6 3.154567957
7 3.364675664
8 2.935450255

For N=1024
1 1
2 1.948073886
3 2.790286666
4 3.537633931
5 4.244896508
6 4.76614606
7 5.264932759
8 5.429620953

AndrewC · ‎01-07-2010

Quoting - tim18

If you link with libiompprof5 (/Qopenmp_profile, if using ICL or IFORT), a performance summary of each threaded region should be written in guide.gvs. This will show the balance of work and barrier times among threads. Compare with various appropriate settings of KMP_AFFINITY environment variable, e.g. SET KMP_AFFINITY=compact,0,verbose. Check the echo to see that the core numbering has been understood.
For 2,4,6 threads try also with the threads split 50-50 between processors, but not alternating.
For a GUI display, the guide.gvs may be imported into VTune, or Thread Profiler could be used.
VTune or PTU event sampling should enable you to get more detail, assuming there is cache capacity limitation. The reduced cache size and memory bus capacity of the 5506, compared with full featured models, may become a handicap at some problem size in ?gemm. Then, it would be interesting to compare the results when you use ifort to compile your ?gemm from public source with debug symbols enabled.

Hi Tim,
This does bring up the problem Intel has with a annoying and confusing proliferation of products all related to threading(*). The annoying part is that I am paying an "arm and a leg" for C++ Professional and Fortran Professional yet I do not get access to Vtune or Thread Profiler! Yet I get IPP which is probably less useful to most people.

Partial list....
Compilers with OpenMP
Intel TBB
VTune
Parallel Studio
Thread Profiler
Intel IPP
Intel MKL
Intel Parallel Amplifier?
....

Andrew

TimP · ‎01-07-2010

In my experience, the useful part of Thread Profiler for OpenMP is what you get with the openmp-profile library, which comes with the compilers. I had another go this morning at the Windows VTune Thread Profiler with no luck.
TBB and MKL also come with the C++ compilers (same MKL also with Fortran).
Parallel Studio is a package including slightly simplified C++ with OpenMP, TBB, IPP, Amplifier (simplified VTune). If it met your needs, you probably wouldn't buy the others.
If you didn't need all the capabilities of VTune, or were not developing for current Intel CPUs, you would probably use gprof or oprofile/CodeAnalyst.