How to Debug / Open MP related

Dix_Carneiro__Rafael · ‎11-14-2019

Hi all,

I am using the Intel Fortran compiler for Windows -- Parallel Studio XE 2019 Update 5 -- with Microsoft Visual Studio.

My codes successfully compiles and runs smoothly under Debug mode. However, under Release mode it crashes with message (from Visual Studio):

Exception thrown at 0x00711AAE in TradeInformality_FineGrid.exe: 0xC0000005: Access violation reading location 0x00000000.

After some research, I found out it crashes here (and, more precisely, as soon as it enters the parallel do part):

    maxthr = omp_get_max_threads()
    
    ! Set the number of threads
    Call omp_set_num_threads(maxthr)
	
	!$omp parallel do private(j,k)
    do k = 1, nZ
        do j = 1, n_nodes
			BBT(j,k) = maxval( tmp(:,k) - C_subT(:,j) )
        end do
    end do
    !$omp end parallel do

If I remove the OpenMP directives or if I "Generate Sequential Code (/Qopenmp_stubs)", the code runs fine. So, I am unsure what may be wrong here. Any ideas on how to debug this?

Many thanks,
Rafael

jimdempseyatthecove · ‎11-14-2019

Can you show the declarations of BBT, tmp and C_subT?

Reading location 0x00000000 would indicate one of them is (may be) an uninitialized pointer or unallocated array.

Apparently you are running a 32-bit application.

In Release build, see what happens if you add the runtime check for array bounds checking. This will inhibit vectorization of the loop, but it should not affect the declarations of BBT, tmp and C_subT.

Also, in Release build, without the runtime check for array bounds checking, what happens with !DIR$ NOVECTOR placed in front of do j loop?

I seem to recall an old bug that may have resurfaced itself where one of the CPU registers used to reference base of an array is erroneously zeroed. If you are adventuresome can you generate your Release build with Debug symbols (both compiler and linker options) place a break point on the maxval statement. pause all threads except for the current thread (threads pane in debugger), open the registers and disassembly windows then single step with focus in the disassembly window. Before each step, see if the base register is zero.

008012C2  movups      xmmword ptr [edx+eax*8+10h],xmm2

In the above, edx is the base register, eax is the index, and 8 is the scale factor, 10h is an offset

Because the target address of the exception was 0x00000000, I would expect the two registers and offset to be 0.

Jim Dempsey

Dix_Carneiro__Rafael · ‎11-14-2019

Dear Jim,

Many thanks for your thoughtful response!

Yes, I double checked, and the variables you mention seem to be well declared and initialized.

real(KIND=DOUBLE), dimension(:,:), allocatable :: tmp(:,:), BBT(:,:), C_subT(:,:)

allocate(C_subT(nE,n_nodes))
allocate(BBT(n_nodes,nZ))
allocate(tmp(nE,nZ))

"In Release build, see what happens if you add the runtime check for array bounds checking." So, if I add the runtime check for array bounds checking, the code runs smoothly. No error!

"In Release build, without the runtime check for array bounds checking, what happens with !DIR$ NOVECTOR placed in front of do j loop?" I get the same error!

        !$omp parallel do private(j,k)
        do k = 1, nZ
            !DIR$ NOVECTOR
            do j = 1, n_nodes
                BBT(j,k) = maxval( tmp(:,k) - C_subT(:,j) )
            end do
        end do
        !$omp end parallel do

I have not fully understood the rest of your suggestions. How can I "generate my Release build with Debug symbols (both compiler and linker options) "?

Many thanks again,
Rafael

jimdempseyatthecove · ‎11-15-2019

In the VS IDE select the Release Build
then in the Solution Explorer pane Right-Click on the Project for the application, then choose Properties
Verify, and select if necessary, that the Configuration and Platform pull-downs are set for Release (or all) and the platform of choice.
Expand Configuration Properties
Expand Fortran
Select General
Click in the value field of the property Debug Information Format, pull-down and select Full, Click Apply button
Expand Linker
Click on Debugging
Click in value field of Generate Debug Info, pull-down and select Yes
Click Apply, OK
Rebuild

Note, different versions MS VS IDE may have different legends and/or Property tree organizations. IOW you may have to hunt a little to locate these properties.

Jim Dempsey

Dix_Carneiro__Rafael · ‎11-15-2019

Dear Jim,

Many thanks for the details.

I compiled the code on Release mode with the debug options you requested. I also added a break point where you suggested. Here is the result:

0062F767 mov ecx,dword ptr [ebp+20h]

However, be aware that under the options above, the code runs smoothly. I am not able to replicate the error with these options.

Also, interestingly, if I add the "write" line below, the code also works smoothly. Do you think this is a bug in the compiler?

        !$omp parallel do private(j,k)
        do k = 1, nZ
            do j = 1, n_nodes
                write(*,*) 'k=', k, 'j=', j
                BBT(j,k) = maxval( tmp(:,k) - C_subT(:,j) )
            end do
        end do
        !$omp end parallel do

Many thanks,
Rafael

jimdempseyatthecove · ‎11-15-2019

Without the write statement, that loop in release mode would likely execute using vector instructions. With the write statement, the loop will execute using scalar instructions. IOW different code (exclusive of write).

I do think at this point it appears to be a bug in the compiler.

As a means to coax the compiler in generating different SIMD code, try:

!$omp parallel do private(j,k)
do k = 1, nZ
    !dir$ simd
    do j = 1, n_nodes
        BBT(j,k) = maxval( tmp(:,k) - C_subT(:,j) )
    end do
end do
!$omp end parallel do

While the simd compiler directive shouldn't be required in this case, see if it corrects the problem.

lf that is unproductive, try

!dir$ simd vectorlengthfor(double)

You should submit a bug report and your work around if successful.

Jim Dempsey

jimdempseyatthecove · ‎11-15-2019

*** Side note

maxval( tmp(:,k) - C_subT(:,j) )

will internally generate the equivalent of a DO loop, either scalar or vector.

Therefor, one other quick test is to try:

!$omp parallel do private(j,k)
do k = 1, nZ
    do j = 1, n_nodes
        !dir$ simd
        BBT(j,k) = maxval( tmp(:,k) - C_subT(:,j) )
    end do
end do
!$omp end parallel do

Jim Dempsey

Dix_Carneiro__Rafael · ‎11-15-2019

Great, many thanks.

Before I submit a bug report, there is one more piece of information.

I usually turn on the /Qparallel option, together with /Qopenmp:

/nologo /O2 /Qparallel /heap-arrays0 /Qopenmp /module:"Release\\" /object:"Release\\" /Fd"Release\vc150.pdb" /libs:static /threads /Qmkl:sequential /c

Now, if I remove /Qparallel from the command line, I have no error and the code runs smoothly.

Is it wrong to compile with both /Qparallel and /Qopenmp?

jimdempseyatthecove · ‎11-15-2019

**** I usually turn on the /Qparallel option, together with /Qopenmp

NO - Bad idea

Use one or none

The compiler can generate OpenMP directive parallelization, implicit parallelization, but it is bad and error prone practice to use both.

Your loop (without the !dir$ simd), and both options, would have generated code to use OpenMP on the do k loop, and auto-generate parallel code on:
do j
or
maxval
or do j an maxval

in the process you would be generating nested thread pools.

Assume your system has 8 hardware threads, the OpenMP loop will generate a top level OpenMP thread pool with 8 threads. Then each thread executing the parallel do j loop, when encountering the auto-parallel "region" will generate a non-OpenMP thread pool (even though it may do so borrowing code from OpenMP runtime system). Now your system will have 8 pools, each with 8 threads (64 threads), should maxval with the array expression itself be auto-parallelized within the auto-parallel do j loop, then each thread of that nested level will generate a non-OpenMP thread pool with 8 threads. 8*8*8 threads (512 threads).

Jim Dempsey

Dix_Carneiro__Rafael · ‎11-15-2019

Thank you very much, Jim.

Is there a general preference for using qparallel vs qopenmp for parallelizing loops?

jimdempseyatthecove · ‎11-15-2019

In my opinion, auto-parallelism is only warranted in rather trivial programs that can benefit from parallelization and where the programmer (support person) is reluctant or prohibited from making any source code changes. By trivial I mean programs of low complexity that typically have loops with no nest levels. In more complex programs, typically those with nested loops, it is difficult for the compiler to determine where best to place the auto-parallel regions, and in particular where detection of nested usage is not clear to the compiler, or where intrinsic functions (maxval on array expression) may not be aware that it is being executed within a parallel region.

Jim Dempsey