I am using the Intel Fortran compiler for Windows -- Parallel Studio XE 2019 Update 5 -- with Microsoft Visual Studio.
My codes successfully compiles and runs smoothly under Debug mode. However, under Release mode it crashes with message (from Visual Studio):
Exception thrown at 0x00711AAE in TradeInformality_FineGrid.exe: 0xC0000005: Access violation reading location 0x00000000.
After some research, I found out it crashes here (and, more precisely, as soon as it enters the parallel do part):
maxthr = omp_get_max_threads() ! Set the number of threads Call omp_set_num_threads(maxthr) !$omp parallel do private(j,k) do k = 1, nZ do j = 1, n_nodes BBT(j,k) = maxval( tmp(:,k) - C_subT(:,j) ) end do end do !$omp end parallel do
If I remove the OpenMP directives or if I "Generate Sequential Code (/Qopenmp_stubs)", the code runs fine. So, I am unsure what may be wrong here. Any ideas on how to debug this?
Can you show the declarations of BBT, tmp and C_subT?
Reading location 0x00000000 would indicate one of them is (may be) an uninitialized pointer or unallocated array.
Apparently you are running a 32-bit application.
In Release build, see what happens if you add the runtime check for array bounds checking. This will inhibit vectorization of the loop, but it should not affect the declarations of BBT, tmp and C_subT.
Also, in Release build, without the runtime check for array bounds checking, what happens with !DIR$ NOVECTOR placed in front of do j loop?
I seem to recall an old bug that may have resurfaced itself where one of the CPU registers used to reference base of an array is erroneously zeroed. If you are adventuresome can you generate your Release build with Debug symbols (both compiler and linker options) place a break point on the maxval statement. pause all threads except for the current thread (threads pane in debugger), open the registers and disassembly windows then single step with focus in the disassembly window. Before each step, see if the base register is zero.
008012C2 movups xmmword ptr [edx+eax*8+10h],xmm2
In the above, edx is the base register, eax is the index, and 8 is the scale factor, 10h is an offset
Because the target address of the exception was 0x00000000, I would expect the two registers and offset to be 0.
Many thanks for your thoughtful response!
Yes, I double checked, and the variables you mention seem to be well declared and initialized.
real(KIND=DOUBLE), dimension(:,:), allocatable :: tmp(:,:), BBT(:,:), C_subT(:,:) allocate(C_subT(nE,n_nodes)) allocate(BBT(n_nodes,nZ)) allocate(tmp(nE,nZ))
"In Release build, see what happens if you add the runtime check for array bounds checking." So, if I add the runtime check for array bounds checking, the code runs smoothly. No error!
"In Release build, without the runtime check for array bounds checking, what happens with !DIR$ NOVECTOR placed in front of do j loop?" I get the same error!
!$omp parallel do private(j,k) do k = 1, nZ !DIR$ NOVECTOR do j = 1, n_nodes BBT(j,k) = maxval( tmp(:,k) - C_subT(:,j) ) end do end do !$omp end parallel do
I have not fully understood the rest of your suggestions. How can I "generate my Release build with Debug symbols (both compiler and linker options) "?
Many thanks again,
In the VS IDE select the Release Build
then in the Solution Explorer pane Right-Click on the Project for the application, then choose Properties
Verify, and select if necessary, that the Configuration and Platform pull-downs are set for Release (or all) and the platform of choice.
Expand Configuration Properties
Click in the value field of the property Debug Information Format, pull-down and select Full, Click Apply button
Click on Debugging
Click in value field of Generate Debug Info, pull-down and select Yes
Click Apply, OK
Note, different versions MS VS IDE may have different legends and/or Property tree organizations. IOW you may have to hunt a little to locate these properties.
Many thanks for the details.
I compiled the code on Release mode with the debug options you requested. I also added a break point where you suggested. Here is the result:
0062F767 mov ecx,dword ptr [ebp+20h]
However, be aware that under the options above, the code runs smoothly. I am not able to replicate the error with these options.
Also, interestingly, if I add the "write" line below, the code also works smoothly. Do you think this is a bug in the compiler?
!$omp parallel do private(j,k) do k = 1, nZ do j = 1, n_nodes write(*,*) 'k=', k, 'j=', j BBT(j,k) = maxval( tmp(:,k) - C_subT(:,j) ) end do end do !$omp end parallel do
Without the write statement, that loop in release mode would likely execute using vector instructions. With the write statement, the loop will execute using scalar instructions. IOW different code (exclusive of write).
I do think at this point it appears to be a bug in the compiler.
As a means to coax the compiler in generating different SIMD code, try:
!$omp parallel do private(j,k) do k = 1, nZ !dir$ simd do j = 1, n_nodes BBT(j,k) = maxval( tmp(:,k) - C_subT(:,j) ) end do end do !$omp end parallel do
While the simd compiler directive shouldn't be required in this case, see if it corrects the problem.
lf that is unproductive, try
!dir$ simd vectorlengthfor(double)
You should submit a bug report and your work around if successful.
*** Side note
maxval( tmp(:,k) - C_subT(:,j) )
will internally generate the equivalent of a DO loop, either scalar or vector.
Therefor, one other quick test is to try:
!$omp parallel do private(j,k) do k = 1, nZ do j = 1, n_nodes !dir$ simd BBT(j,k) = maxval( tmp(:,k) - C_subT(:,j) ) end do end do !$omp end parallel do
Great, many thanks.
Before I submit a bug report, there is one more piece of information.
I usually turn on the /Qparallel option, together with /Qopenmp:
/nologo /O2 /Qparallel /heap-arrays0 /Qopenmp /module:"Release\\" /object:"Release\\" /Fd"Release\vc150.pdb" /libs:static /threads /Qmkl:sequential /c
Now, if I remove /Qparallel from the command line, I have no error and the code runs smoothly.
Is it wrong to compile with both /Qparallel and /Qopenmp?
**** I usually turn on the /Qparallel option, together with /Qopenmp
NO - Bad idea
Use one or none
The compiler can generate OpenMP directive parallelization, implicit parallelization, but it is bad and error prone practice to use both.
Your loop (without the !dir$ simd), and both options, would have generated code to use OpenMP on the do k loop, and auto-generate parallel code on:
or do j an maxval
in the process you would be generating nested thread pools.
Assume your system has 8 hardware threads, the OpenMP loop will generate a top level OpenMP thread pool with 8 threads. Then each thread executing the parallel do j loop, when encountering the auto-parallel "region" will generate a non-OpenMP thread pool (even though it may do so borrowing code from OpenMP runtime system). Now your system will have 8 pools, each with 8 threads (64 threads), should maxval with the array expression itself be auto-parallelized within the auto-parallel do j loop, then each thread of that nested level will generate a non-OpenMP thread pool with 8 threads. 8*8*8 threads (512 threads).
In my opinion, auto-parallelism is only warranted in rather trivial programs that can benefit from parallelization and where the programmer (support person) is reluctant or prohibited from making any source code changes. By trivial I mean programs of low complexity that typically have loops with no nest levels. In more complex programs, typically those with nested loops, it is difficult for the compiler to determine where best to place the auto-parallel regions, and in particular where detection of nested usage is not clear to the compiler, or where intrinsic functions (maxval on array expression) may not be aware that it is being executed within a parallel region.