Solved: Re: IFX OpenMP GPU SIMD Not Working?

Theo_M · ‎08-13-2024

Hi everyone,

I'm moving from GFortran offloading onto Nvidia to IFX with offloading onto an Intel Arc A770 on Kubuntu 24.04 for my PhD.

I think I've managed to work through the nightmare installation process of dependencies (but not level zero, etc. but I'm not seeing the performance I expect. In particular, I'm using the code at the end to profile the CPU and GPU with OpenMP, compiling with:

ifx -xhost -qopenmp -fopenmp-targets=spir64 main.f90

Which gives the following performance figures:

CPU initialised with 16 threads
GPU initialised with  512 teams, totalling 524288 threads

CPU Parallel (no SIMD):  1x16 threads
time                = 5395.2 ms
rate                =   12.4 GFLOPS
bandwidth           =    6.2 MB/s
 
CPU SIMD Parallel:  1x16 threads
time                =  674.8 ms
rate                =   99.4 GFLOPS
bandwidth           =   49.7 MB/s
 
GPU Parallel (no SIMD):
time                =  366.8 ms
total_time          =  376.7 ms
rate                =  182.9 GFLOPS
total rate          =  178.1 GFLOPS
bandwidth           =   89.1 MB/s
on device bandwidth =   91.5 MB/s
 
GPU SIMD Parallel:
time                =  367.6 ms
total_time          =  377.2 ms
rate                =  182.5 GFLOPS
total rate          =  177.9 GFLOPS
bandwidth           =   89.0 MB/s
on device bandwidth =   91.3 MB/s

My issue is that there is no difference in the the "GPU Parallel (no SIMD)" and "GPU SIMD Parallel" performance despite the difference in the OpenMP directives ( !$omp target teams distribute parallel do private(si,j) vs !$omp target teams distribute parallel do simd private(si,j) ).

Also, both are far below the expected performance of the GPU. I can understand the non-SIMD code being slow, but my old 2014 Nvidia GTX 970 with GFortran was running the SIMD code in 21 ms rather than 367 ms!

What am I missing! Any help or thoughts appreciated before I lose my mind!

Thanks,

Theo

Code used for profiling:

program MAIN
    use OMP_LIB
    implicit none


    call initialise_cpu()
    call initialise_gpu()

    call run_program()

contains

    subroutine initialise_cpu()
        integer :: num_threads

        num_threads = omp_get_max_threads()

        write(*,"(A, I2, A)") "CPU initialised with ", num_threads, " threads"

    end subroutine

    subroutine initialise_gpu()
        integer :: num_teams
        integer :: num_threads

        num_teams = 0
        num_threads = 0

        !$omp target teams map(tofrom:num_teams,num_threads)
            num_teams = omp_get_num_teams()
            !$omp parallel
                !$omp atomic
                    num_threads = num_threads + 1
                !$omp end atomic
            !$omp end parallel
        !$omp end target teams

        write(*,"(A, I4, A, I6, A)") "GPU initialised with ", num_teams, " teams, totalling ", num_threads, " threads"

    end subroutine


    subroutine run_program()
        ! integer, dimension(160 * 64 * 16) :: inputs

        real(8) :: t01, t02, t1, t2
        real(8) :: T_c, T_g, epsilon, T_h

        integer :: i, j, si

        integer :: num_teams, num_threads

        real, dimension(:), allocatable :: dummy_real_array
        integer, parameter :: NumRepetitions = 2000

        integer :: division_point, size_dummy_real_array


        allocate(dummy_real_array(256 * 1024 * 32))


        t01 = omp_get_wtime()

        num_teams = 1
        num_threads = omp_get_max_threads()

        !$omp teams distribute parallel do private(si,j) shared(dummy_real_array) num_teams(num_teams) num_threads(num_threads)
        do si = 1, size(dummy_real_array)
            dummy_real_array(si) = real(si)
            do j = 1, NumRepetitions
                dummy_real_array(si) = dummy_real_array(si)**3
                dummy_real_array(si) = dummy_real_array(si) - (317.0 * int(dummy_real_array(si) / 317.0))
            end do
        end do

        t02 = omp_get_wtime()
        
        T_c = t02 - t01

        print "(A, I2, A, I2, A)",   "CPU Parallel (no SIMD): ", num_teams , "x", num_threads, " threads"
        print "(A, F6.1, A)", "time                = ", T_c * 1e3, " ms"
        print "(A, F6.1, A)", "rate                = ", (size(dummy_real_array, kind =  * NumRepetitions * (4_8) / (t02 - t01)) / (1e9), " GFLOPS"
        print "(A, F6.1, A)", "bandwidth           = ", (sizeof(dummy_real_array) / (t02 - t01)) / (1e6),   " MB/s"
        print *, ""



        t01 = omp_get_wtime()

        num_teams = 1
        num_threads = omp_get_max_threads()

        !$omp teams distribute parallel do simd private(si,j) shared(dummy_real_array) num_teams(num_teams) num_threads(num_threads)
        do si = 1, size(dummy_real_array)
            dummy_real_array(si) = real(si)
            do j = 1, NumRepetitions
                dummy_real_array(si) = dummy_real_array(si)**3
                dummy_real_array(si) = dummy_real_array(si) - (317.0 * int(dummy_real_array(si) / 317.0))
            end do
        end do

        t02 = omp_get_wtime()
        
        T_c = t02 - t01

        print "(A, I2, A, I2, A)",   "CPU SIMD Parallel: ", num_teams , "x", num_threads, " threads"
        print "(A, F6.1, A)", "time                = ", T_c * 1e3, " ms"
        print "(A, F6.1, A)", "rate                = ", (size(dummy_real_array, kind =  * NumRepetitions * (4_8) / (t02 - t01)) / (1e9), " GFLOPS"
        print "(A, F6.1, A)", "bandwidth           = ", (sizeof(dummy_real_array) / (t02 - t01)) / (1e6),   " MB/s"
        print *, ""



        
        t1 = omp_get_wtime()

        !$omp target enter data map(alloc:dummy_real_array)

        t01 = omp_get_wtime()

        !$omp target teams distribute parallel do private(si,j)
        do si = 1, size(dummy_real_array)

            dummy_real_array(si) = real(si)
            do j = 1, NumRepetitions
                dummy_real_array(si) = dummy_real_array(si)**3
                dummy_real_array(si) = dummy_real_array(si) - (317.0 * int(dummy_real_array(si) / 317.0))
            end do

        end do

        t02 = omp_get_wtime()

        !$omp target exit data map(from:dummy_real_array)
        !$omp target exit data map(release:dummy_real_array)

        t2 = omp_get_wtime()

        print "(A)", "GPU Parallel (no SIMD):"
        print "(A, F6.1, A)", "time                = ", (t02 - t01) * 1e3, " ms"
        print "(A, F6.1, A)", "total_time          = ", (t2 - t1) * 1e3, " ms"
        print "(A, F6.1, A)", "rate                = ", (size(dummy_real_array, kind =  * NumRepetitions * (4_8) / (t02 - t01)) / (1e9), " GFLOPS"
        print "(A, F6.1, A)", "total rate          = ", (size(dummy_real_array, kind =  * NumRepetitions * (4_8) / (t2 - t1)) / (1e9), " GFLOPS"
        print "(A, F6.1, A)", "bandwidth           = ", (sizeof(dummy_real_array) / (t2 - t1)) / (1e6),   " MB/s"
        print "(A, F6.1, A)", "on device bandwidth = ", (sizeof(dummy_real_array) / (t02 - t01)) / (1e6), " MB/s"
        print *, ""





        t1 = omp_get_wtime()

        !$omp target enter data map(alloc:dummy_real_array)

        t01 = omp_get_wtime()

        !$omp target teams distribute parallel do simd private(si,j)
        do si = 1, size(dummy_real_array)

            dummy_real_array(si) = real(si)
            do j = 1, NumRepetitions
                dummy_real_array(si) = dummy_real_array(si)**3
                dummy_real_array(si) = dummy_real_array(si) - (317.0 * int(dummy_real_array(si) / 317.0))
            end do

        end do

        t02 = omp_get_wtime()

        !$omp target exit data map(from:dummy_real_array)
        !$omp target exit data map(release:dummy_real_array)

        t2 = omp_get_wtime()

        T_g = t2 - t1

        print "(A)", "GPU SIMD Parallel:"
        print "(A, F6.1, A)", "time                = ", (t02 - t01) * 1e3, " ms"
        print "(A, F6.1, A)", "total_time          = ", T_g * 1e3, " ms"
        print "(A, F6.1, A)", "rate                = ", (size(dummy_real_array, kind =  * NumRepetitions * (4_8) / (t02 - t01)) / (1e9), " GFLOPS"
        print "(A, F6.1, A)", "total rate          = ", (size(dummy_real_array, kind =  * NumRepetitions * (4_8) / (t2 - t1)) / (1e9), " GFLOPS"
        print "(A, F6.1, A)", "bandwidth           = ", (sizeof(dummy_real_array) / (t2 - t1)) / (1e6),   " MB/s"
        print "(A, F6.1, A)", "on device bandwidth = ", (sizeof(dummy_real_array) / (t02 - t01)) / (1e6), " MB/s"
        print *, ""

    end subroutine
    
end program MAIN

Additionally, with export LIBOMPTARGET_PLUGIN_PROFILE=T I get the following output:

================================================================================
LIBOMPTARGET_PLUGIN_PROFILE(OPENCL) for OMP DEVICE(0) Intel(R) Arc(TM) A770 Graphics, Thread 0
--------------------------------------------------------------------------------
-- Kernel 0                  : __omp_offloading_802_40249a_main_IP_initialise_gpu__l29
-- Kernel 1                  : __omp_offloading_802_40249a_main_IP_run_program__l121
-- Kernel 2                  : __omp_offloading_802_40249a_main_IP_run_program__l158
--------------------------------------------------------------------------------
-- Name                      :     Host Time (msec)   Device Time (msec)
-- Compiling                 :                0.357                0.000
-- DataAlloc                 :                0.393                0.000
-- DataRead (Device to Host) :                7.299                7.063
-- DataWrite (Host to Device):                0.147                0.024
-- Kernel 0                  :                1.550                0.043
-- Kernel 1                  :              366.670              366.194
-- Kernel 2                  :              367.493              366.803
-- Linking                   :              382.306                0.000
-- Total                     :             1126.215              740.127
================================================================================

jimdempseyatthecove · ‎08-13-2024

The structure of your loop presents a challenging optimization issue:

        !$omp teams distribute parallel do simd private(si,j) shared(dummy_real_array) num_teams(num_teams) num_threads(num_threads)
        do si = 1, size(dummy_real_array)
            dummy_real_array(si) = real(si)
            do j = 1, NumRepetitions
                dummy_real_array(si) = dummy_real_array(si)**3
                dummy_real_array(si) = dummy_real_array(si) - (317.0 * int(dummy_real_array(si) / 317.0))
            end do
        end do

The interior do j loop is causing the optimization to produce non-simd code (scalar code) whereas other Fortran compilers can produce simd code. Your code is an example where the Intel team can make some improvements.

If this example code is representative of production code (as opposed to benchmark code), then you will need to hand partition the si iteration space per team. IOW !$omp teams distribute without the parallel...

then split the iteration space si to private iBegin, iEnd slice variables, then add an internal !$omp parallel do simd...

on the slice of the si space.

Jim Dempsey

View solution in original post

jimdempseyatthecove · ‎08-13-2024

The structure of your loop presents a challenging optimization issue:

        !$omp teams distribute parallel do simd private(si,j) shared(dummy_real_array) num_teams(num_teams) num_threads(num_threads)
        do si = 1, size(dummy_real_array)
            dummy_real_array(si) = real(si)
            do j = 1, NumRepetitions
                dummy_real_array(si) = dummy_real_array(si)**3
                dummy_real_array(si) = dummy_real_array(si) - (317.0 * int(dummy_real_array(si) / 317.0))
            end do
        end do

The interior do j loop is causing the optimization to produce non-simd code (scalar code) whereas other Fortran compilers can produce simd code. Your code is an example where the Intel team can make some improvements.

If this example code is representative of production code (as opposed to benchmark code), then you will need to hand partition the si iteration space per team. IOW !$omp teams distribute without the parallel...

then split the iteration space si to private iBegin, iEnd slice variables, then add an internal !$omp parallel do simd...

on the slice of the si space.

Jim Dempsey

Theo_M · ‎08-13-2024

Hi Jim,

Thanks for your reply - it's really appreciated!

As a quick test, I removed the inner "do j" loop and replaced it with twenty consecutive:

dummy_real_array(si) = dummy_real_array(si)**3
dummy_real_array(si) = dummy_real_array(si) - (317.0 * int(dummy_real_array(si) / 317.0))

This has dramatically sped up the parallel CPU, the parallel GPU, and the SIMD parallel GPU figures- as expected for reducing the workload by a factor of 100! Significantly, the rate (GFLOPS) has reduced, see latest results below.

I note that the GPU parallel (no SIMD) and the SIMD parallel GPU versions are still similar in performance. This, and the reduced throughput, may be because the test is now too lightweight- I'm unsure. Could it be the case that the no SIMD version is being optimised automatically by the compiler (or the other way around)?

I will do some further tests and feed back.

Many thanks,

Theo

PS: After a painful install of IGC, I have also got level zero working!

Latest results:

CPU initialised with 16 threads
GPU initialised with  512 teams, totalling 524288 threads

CPU Parallel (no SIMD):  1x16 threads
time                =         11.6 ms
rate                =         57.7 GFLOPS
bandwidth           =       2884.7 MB/s
 
CPU SIMD Parallel:  1x16 threads
time                =          9.9 ms
rate                =         67.6 GFLOPS
bandwidth           =       3381.8 MB/s
 
GPU Parallel (no SIMD):
time                =          4.3 ms
total_time          =         11.3 ms
rate                =        156.0 GFLOPS
total rate          =         59.2 GFLOPS
bandwidth           =       2961.5 MB/s
on device bandwidth =       7798.0 MB/s
 
GPU SIMD Parallel:
time                =          5.8 ms
total_time          =         12.1 ms
rate                =        115.8 GFLOPS
total rate          =         55.4 GFLOPS
bandwidth           =       2768.7 MB/s
on device bandwidth =       5791.2 MB/s

Theo_M · ‎08-13-2024

As promised, I have looked a bit further into this.

Following Jim's recommendation above, I moved the "do j" loop into a pure elemental subroutine (I think it only needed pure, TBH) and it is now crazy fast - Thanks, Jim!

What I am finding, however, is that irrespective of "SIMD" as an OpenMP direction and irrespective of the "-qno-openmp-simd" compiler flag, I get the same performance.

Does anyone know if IFX respects this directive and/or flag: what I don't know at the moment is if SIMD is active or not, or even if the Arc 770/IFX doesn't think in these terms.

I'd appreciate any info on determining if SIMD is active and/or how to control it. Things I've already tried include -qno-openmp-simd" and inspecting the "-qopt-report" output, which doesn't show SIMD optimisations applied to the GPU (it does for CPU).

Finally, does anyone know why the "-fast" flag produces this error?

ld: cannot find -lomptarget: No such file or directory
ld: /home/tjm/intel/oneapi/compiler/2024.2/lib/libifcoremt.a(for_close_proc.o): in function `for__close_proc':
for_close_proc.c:(.text+0x1df): warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking

Thanks,

Theo

Shiquan_Su · ‎08-27-2024

Hi, Theo:

Currently, the GPU device compiler is already smart enough to apply most of the hardware computing power in the general omp construct: "!$omp target teams distribute parallel do". In this case, the compiler already generates an executable that is efficient enough to use all the available execution units on the GPU device.

Your more sophisticated omp construct: "!$omp target teams distribute parallel do simd" does not access to more computing power. The simd clause does give you some memory access/stride benefits in some circumstances, but it is not observable here.

Since you already know how to leverage the "LIBOMPTARGET_PLUGIN_PROFILE=T" output to analyze your code, which shows your great skills and proficiency. You may also try replacing "LIBOMPTARGET_PLUGIN_PROFILE=T" by "LIBOMPTARGET_DEBUG=1 LIBOMPTARGET_INFO=7". You can examine the kernel execution policy in the output file, by searching the keyword "Launching".

What is the compilation command you used with -fast that generated the error? Have you included "-qopenmp -fopenmp-targets=spir64" in the command? Also, have you initialized the oneAPI env? The error message says the compiler is not prepared to link the omp target runtime. Usually it comes from the compiler do not have proper flags and/or can not find the proper library first in the $PATH.

Theo_M · ‎08-28-2024

Dear Shiquan,

I figured this out from Jim's answer above! I'm more used to the GFortran flavour of offloading, where, on older hardware, the `simd` directive is necessary.

I'll quickly point out that the Intel compiler was not smart enough to apply apply all the available power until I moved the contents of the original loop into a pure elemental subroutine.

With regards to the errors I was seeing with -fast, I have all the same flags as in my original post as below:

ifx -xhost -qopenmp -fopenmp-targets=spir64 -fast ./app/main.f90

The OneAPI environment is initialised: if it wasn't, IFX wouldn't even be detected as a command! Replacing `-fast` with `-Ofast` does work and compiles.

Do you know what libraries it is looking with in $PATH and why they might not have been built as part of the standard installation process/ordeal?

Thanks,

Theo

Shiquan_Su · ‎08-29-2024

Hi, Theo:

I agree with you. I am glad that you figured out the -Ofast works for your code. You may stay with -Ofast if that is how it works for you. -Ofast is a slightly different version of -fast, but probably provides very similar optimization. The -Ofast is provided for compatibility with GCC.

There is no simple answer for "what libraries it is looking for within $PATH and why they might not have been built as part of the standard installation process/ordeal?"

You may study the following two web pages for the difference between two flags.

https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/fast.html
https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/ofast.html

Theo_M · ‎08-31-2024

Thanks for the reply and links (I'd already found the difference in flags, but thanks anyway).

I'm unsure why `-static` causes this - it may need `-qopenmp -fopenmp-targets=spir64_gen` rather than `-qopenmp -fopenmp-targets=spir64`, but I have not had a chance to check this yet.

Thanks,

Theo