I am developing a Finite Element Software for mechanical Engineering using iterative solvers and nonlinear mechanics using Fortran in OMP mode.
The serial code works perfectly.
However, the OMP version fails randomly within the iterative solver computing NaNs after thousands of solver iterations and tens of nonlinear iterations. At failure, it is always at the 1st solver iteration that this happens. If it does not happen at the first iteration, it continues to complete the solver solution for that nonlinear iteration successfully.
If I was to increase the STACKSIZE to approximately 100M, the software works, if nothing else runs on my server (AMD EPYC 7702 64 core/128 thread processor, dual socket), otherwise it fails again randomly.
In addition, the more OMP parallel do loops I add to the software, the more frequently and randomly NaN failures occur, always at the same location and always at the 1st solver iteration.
All vectors and arrays are ALLOCATABLE, but is it possible that there is an issue with either the AMD processor or the libiomp5md.dll I use?
Is there any quick advice to look at?
On your !$omp parallel ....
add clause default(none)
then explicitly add shared(xxx,yyy) and or private(zzz) as the cases may be.
This sounds like you have a shared variable that is required to be private
you have private, when you require initial value carried in and should have used firstprivate.
Note, on older versions of IVF you needed to pass in an unallocated array using firstprivate such as to obtain a copy of the unallocated array descriptor.
You have a race condition, say in a reduction.
calling a subroutine/function that was developed using single thread model and used SAVE to maintain advancing state variables between calls
other non-thread save coding.
NaNs generally occur with divide by zero. And this is an indication that something was not what you thought (required) it to be.
Thank you for your response.
there is an ATOMIC UPDATE statement, please see below, which I do not know if it causes any racing. The U vector is the one that most likely causing the NaN problem. It is a SHARED property and so is NDOF. The u_ele REAL vector and the g INTEGER vector are PRIVATE.
Do j = 1, ndof
!$Omp Atomic Update
u(g(j)) = u(g(j)) + u_ele(j)
However, f there is any of the issues you are referring to, will you expect the software to work as I increase the STACKSIZE=100M?
Is it possible that we can we have a one to one quick session to show you exactly what I have done, as you may be able to pick it up immediately?
Is the DO J loop an OMP parallel loop .OR. is this inside a larger parallel region?
How large is ndof? (iow is it large enough for parallelization?)
If ndof is large enough .AND. if the DO J loop is a OMP parallel loop, consider a filter approach.
nElementsPerCacheLine = 64 / sizeof(u(1)) !$omp parallel private(jj, iThread)! lack of DO iThread = omp_get_thread_num() nThreads = omp_get_num_threads() ! inside parallel region ! All threads issue this do DO j=1,ndof jj = g(j) if(mod((jj-1)/nElementsPerCacheLine, nThreads) == iThread) then u(jj) = u(jj) + u_ele(j) ! jj is one of ours ! it helps if u and u_ele are cache line aligned endif end do !$end parallel
I do understand that I sent insufficient information, but can only publish parts of the SW.
NDOF = 24 and the do loop is inside a major major parallel region (no of finite elements=250M), which is a process that calculates the u_ele(24) for each finite element cell.
Then, the NDOF loop assembles the global U, which has a dimension equal to 900M using the per element computed U_ELE, which has a dimension of 24.
Hence, to avoid thread racing into this assembly, I utilised the ATOMIC UPDATE option.
Does this change your approach shown in your post?
A NaN is caused by a floating point instruction that is incapable of producing a result. Instructions such as
A = B / 0.0
A = B + NaN
A = sqrt(-3.0) ! I am assuming this returns a NaN, else it faults
Note, +,- and *, excepting when one or both of the arguments are NaN, will not return a NaN. They may return INF or denormal value, or 0.0 but not NaN.
Your ATOMIC statement is performing +, therefore the NaN is generated elsewhere (provided that the initial value for the summation was not a NaN).
Note ATOMIC statements have high(er) latencies, and if possible, should be avoided. If you are not interested in performance, then use the ATOMIC. However, if performance is of issue, then use a few more neurons to construct a different means to perform your reductions.
>>which is a process that calculates the u_ele(24) for each finite element cell.
Not seeing the code, it is difficult to provide a correct response. This said, IIF the above statement is all there is to say about what is happening, then at the point of that DO j=1,ndof, that loop is updating the 24 DOF's of a single element (vertex). It is assumed (presumed) that the context of this do loop is within a loop that processes each element (vertex) where the collection is partitioned via a parallel loop (or other collection slicing method). IIF these conditions are true, then no collection of 24 NDOF's will be processed by more than one thread, and thus the ATOMIC statement can be removed.
My guess is, you were observing NAN's, and you added the ATOMIC directive as a guess as to what was happening.