Parrlel Runge-Kutta on Intel MIC in offload mode with 120 threads is slower than 16 threads on CPU

Cui__zekun · ‎03-28-2018

Hi,

My program ： parrlel Runge-Kutta method on MIC is slower than on CPU

I use openMP to parrlel this program on CPU by using 16 threads and cost 0.3642162s

but when I use offload mode running the program on MIC by using 120 threads, the calculate time come to 0.699159s and the total time come to 7.05742s (including data transfer,creating memory space and calculate time)

(I use one MIC card wirh 60 cores 240 threads)

Please if someone can help understand why even pure calculate time is that long

here is my code:

x[0] = xstart;
             y[0] = yn;

             double start = omp_get_wtime();
             for (int j = 1; j < threa_num; j++)
             {
                  int index = j * (x_m / threa_num);
                  x[index] = xstart + index * m_h;
                  y[index] = sin(x[index]);
            
             }
             #pragma offload target(mic:1) inout(x,y:length(x_m))
             {
                     t = 2;
                     omp_set_num_threads(t);
                       
					start1 = omp_get_wtime();
                #pragma omp parallel for private(k1,k2,k3,k4) firstprivate(flag)

                for (i = 0; i < x_m; i++)
                {
                    
                        x[i + 1] = x + m_h;
						k1 = func(x, y);
                        k2 = func(x + m_h / 2, y + (m_h / 2) * k1);
                        k3 = func(x + m_h / 2, y + (m_h / 2) * k2);
                        k4 = func(x + m_h, y + m_h * k3);
                        y[i + 1] = y + ((k1 + 2 * k2 + 2 * k3 + k4) / 6) * m_h;

                }
              
                end1 = omp_get_wtime();
               

             }
                double end = omp_get_wtime();

TimP · ‎03-29-2018

1) with the race condition in the threaded loop, you are lucky if any threading at all gives a speedup and sufficiently correct result. You will certainly not expect such a large increase in number of threads to be useful. Even if you have correct threading, you must use a means such as KMP_HW_SUBSET to spread your threads evenly across cores in order to see reasonable scaling.

2) among the many requirements for offload mode to be useful would be a number of calculations of a higher order than the number of data transferred. Not just a factor of 4 or so, more like a factor of 4000.

3) as you don't show func() nor a compiler optimization report (and you report reasonable threaded speedup on a host CPU), we must assume func() is not inlined and vectorized. That would be one of the prerequisites for MIC (at least the KNC version) to achieve reasonable performance. Even in comparison with the CPUs available when MIC KNC was launched, it would not be unusual to require a good combination of inner loop vectorization and outer loop parallelism (and an ideal problem size) to see MIC achieve double the performance of host. Yours may be one of those cases where application of optimizations necessary for MIC to be useful will also improve host performance significantly.

jimdempseyatthecove · ‎03-29-2018

Line 15 is setting the number of threads (in the offload region) to 2.

It is not clear from the code listed above if the code starting at line 1 is executed in a parallel region on host (with 16 threads on host), or in a serial region (1 thread on host).

If the above code starts in a serial region on host, then the number of threads used in offload is 2. (2 threads on MIC will be slower than 16 threads on host)

If the above code starts in a parallel region on host, presumably with 16 threads, then the offload region will start (16 times), and on first offload (per host thread) will instantiate 16 OpenMP thread pools (one pool per host thread). Presumably resulting in 32 threads on MIC. (32 threads on MIC, especially scalar computations, are likely slower than 16 threads on host).

*** note 1, the first time an offload is made to MIC, you have the overhead of copying in the program image.
*** note 2, the first time an offload region, as entered by thread n on host, enters an OpenMP parallel region, that instance/context instantiates an OpenMP thread pool (one pool per host associated thread).

Unless your application behaves like your test code (only going to execute once) you should execute your test code more than once (in a single run) and use the timing results from the first iteration as well as second (or more) iterations. The time difference between the first run and second run (or average of later runs) will be the once-only overhead time.

Jim Dempsey

jimdempseyatthecove · ‎03-29-2018

Additional note, if multiple threads are used on host, .AND. multiple OpenMP thread pools are created in MIC, .AND. mic is pinning threads (KMP_AFFINITY=....) then each OpenMP thread pool in MIC will be pinning to same logical processors).

Jim Dempsey

Cui__zekun · ‎03-30-2018

Tim P. wrote:

1) with the race condition in the threaded loop, you are lucky if any threading at all gives a speedup and sufficiently correct result. You will certainly not expect such a large increase in number of threads to be useful. Even if you have correct threading, you must use a means such as KMP_HW_SUBSET to spread your threads evenly across cores in order to see reasonable scaling.

2) among the many requirements for offload mode to be useful would be a number of calculations of a higher order than the number of data transferred. Not just a factor of 4 or so, more like a factor of 4000.

3) as you don't show func() nor a compiler optimization report (and you report reasonable threaded speedup on a host CPU), we must assume func() is not inlined and vectorized. That would be one of the prerequisites for MIC (at least the KNC version) to achieve reasonable performance. Even in comparison with the CPUs available when MIC KNC was launched, it would not be unusual to require a good combination of inner loop vectorization and outer loop parallelism (and an ideal problem size) to see MIC achieve double the performance of host. Yours may be one of those cases where application of optimizations necessary for MIC to be useful will also improve host performance significantly.

Hi, Tim

I will try KMP_HW_SUBSET to spread my threads;

I recorded the time of data transformation, the time is about 5.3s , and it is actually much larger than the calculate time. So maybe I should give up offload mode.

My func() is just a sample cos(x) , and it is not vectorized. Because of the data dependence , I can't vectorized my "for" cycle . So is that true the performace of no-vectorized program on MIC will be worse than on CPU for sure?

Cui__zekun · ‎03-30-2018

jimdempseyatthecove wrote:

Line 15 is setting the number of threads (in the offload region) to 2.

It is not clear from the code listed above if the code starting at line 1 is executed in a parallel region on host (with 16 threads on host), or in a serial region (1 thread on host).

If the above code starts in a serial region on host, then the number of threads used in offload is 2. (2 threads on MIC will be slower than 16 threads on host)

If the above code starts in a parallel region on host, presumably with 16 threads, then the offload region will start (16 times), and on first offload (per host thread) will instantiate 16 OpenMP thread pools (one pool per host thread). Presumably resulting in 32 threads on MIC. (32 threads on MIC, especially scalar computations, are likely slower than 16 threads on host).

*** note 1, the first time an offload is made to MIC, you have the overhead of copying in the program image.
*** note 2, the first time an offload region, as entered by thread n on host, enters an OpenMP parallel region, that instance/context instantiates an OpenMP thread pool (one pool per host associated thread).

Unless your application behaves like your test code (only going to execute once) you should execute your test code more than once (in a single run) and use the timing results from the first iteration as well as second (or more) iterations. The time difference between the first run and second run (or average of later runs) will be the once-only overhead time.

Jim Dempsey

Hi , Jim

It's my fault to paste my code with some wrong points. When I tested my code , I did set variable "t" to 120.

And the line 1 is exexuted in a serial region (1 thread on host)

I can't vectorized my "for" cycle , so I wonder if none-vectorized program running on MIC will preform worse than running on CPU for sure ?