Quote:FortranFan wrote:

grpllrne · ‎03-21-2020

Windows 10 N LTSC; Intel Core I5-5300U; 8GB DDR3.

Compiler version: Intel(R) Visual Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 19.1.0.166 Build 20191121

Compiler options: /nologo /O3 /QaxCORE-AVX2 /QxCORE-AVX2 /tune:broadwell /module:x64\Release\ /object:x64\Release\ /Fdx64\Release\vc160.pdb libs:dll /threads /c /Qopenmp /Qlocation,link,C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.25.28610\bin\HostX64\x64 /Qm64

Hello!

I am new to fortran. I'm doing a small code of discrete elements method. The major problem that i'm facing is the unprecision of large arrays.

For example my code running with 4 particles is ok. But with large systems like +- 100.000 particles something is getting very wrong. In a DT(delta time) = 1.0E-6s i'm got acellerations with magnitudes of 1.0E+7 while the maximum would be the acceleration of gravity.

MODULE DECLARE
implicit none
    integer, parameter :: sp = selected_real_kind(6, 37)
    integer, parameter :: dp = selected_real_kind(15, 307)
    character*90 :: system_name, print_format, solver, errormsg, dummy
    INTEGER :: number_of_particles, number_of_walls, ios, print_vtk, print_number, i, j, k, read_vtk, dimmy
    real(kind=dp):: final_time,time,elasticity_modulus,poisson_coefficient,density,overlap_wall,overlap,temporary_time, &
& effective_elasticity_modulus,effective_mass,effective_radius,norm_position,delta_time_to_print,delta_time
    real(kind=dp), allocatable, dimension(:) :: radius, mass
    real(kind=dp), allocatable, dimension(:,:) :: preceding_position, preceding_velocity, velocity, position, force, t0_position, t0_velocity
    real(kind=dp), allocatable, dimension(:,:) :: wall_point, wall_normal
    real(kind=dp), parameter :: pi = 3.14159265358979_dp
    real(kind=dp), dimension(3), parameter :: gravity = (/ 0.0_dp, -9.7803267714_dp, 0.0_dp /)
    real(kind=dp), DIMENSION(3) :: temp=0,normal_dir=0
END MODULE DECLARE

SUBROUTINE READ_INPUT

...

allocate(t0_position(number_of_particles,3), t0_velocity(number_of_particles,3), force(number_of_particles,3), velocity(number_of_particles,3), &

& radius(number_of_particles), preceding_position(number_of_particles,3),preceding_velocity(number_of_particles,3),position(number_of_particles,3), &

& mass(number_of_particles), STAT=ios, ERRMSG=errormsg)

IF(ios /= 0) THEN

print *, "Error trying to allocate array", errormsg

STOP

END IF

....

END SUBROUTINE READ_INPUT

SUBROUTINE PRINT_VTK_ASCII

...

END SUBROUTINE PRINT_VTK_ASCII

SUBROUTINE EULER
...
DO time=0.0_dp, final_time, delta_time
    force=0.0_dp
    position=0.0_dp
    velocity=0.0_dp
!$OMP PARALLEL DO
    DO i=1, number_of_particles
        DO k=1, number_of_walls

        ....calculations

        END DO

        DO j=1, number_of_particles

        ....calculations

        END DO

    END DO

!$OMP END PARALLEL DO

IF ... THEN

CALL PRINT_VTK_ASCII

END IF

preceding_position = position
preceding_velocity = velocity

END DO

deallocate(t0_position(number_of_particles,3), t0_velocity(number_of_particles,3), force(number_of_particles,3), velocity(number_of_particles,3), &

& radius(number_of_particles), preceding_position(number_of_particles,3),preceding_velocity(number_of_particles,3),position(number_of_particles,3), &

& mass(number_of_particles), STAT=ios, ERRMSG=errormsg)

IF(ios /= 0) THEN

print *, "Error trying to deallocate array", errormsg

STOP

END IF

END SUBROUTINE  EULER

PROGRAM TEST

CALL READ_INPUT

CALL EULER

END PROGRAM TEST

Basically this is a summary of the program(to make clearer the data ordering)... I have tried automatic arrays ( i got stackoverflow), explicit shape ( got severe imprecision too), and allocatable arrays with lack of precision also.

I cannot too print in binary ( i got stackoverflow) but this subroutine in a friend code with more than 300.000 particles worked perfectly.. but in my code no. But in ascii it prints... it is really strange... Honestly i don't know where is the error neither where i search it.

Thank you for your attention

FortranFan · ‎03-21-2020

Gropelli, Henrique wrote:
.. I cannot too print in binary ( i got stackoverflow) but this subroutine in a friend code with more than 300.000 particles worked perfectly.. but in my code no. But in ascii it prints... it is really strange... Honestly i don't know where is the error neither where i search it. ..

@Gropelli, Henrique,

Can you first disclose some basic info re: your effort? e.g., whether it's part of some educational course (homework assignment/exam, final project, etc.) and if so, you would be honoring the terms of the instructor and/or the educational institution by seeking guidance/assistance on a peer-to-peer forum like this one?

Thanks,

jimdempseyatthecove · ‎03-21-2020

With particle-particle interactions using multiple threads, you must avoid having the same particle updated at the same time by two different threads.

IOW you cannot have one thread calculating and updating interactions (e.g. force) between two particles A and B while a different thread is calculating and updating interactions (e.g. force) between two particles one of which is A or B. You will experience a race condition between a read, modify and write by one thread with the same operations by a different thread.

One common method is to partition the particles into tiles, and then have each thread work on combination of tiles that do not include tile combinations used by any of the other threads. Example using 2 threads is to partition number of particles to 2*nThreads (in this case 4)

Partitions 4
A B C D

thread 0 | 1

A A | C C
B B | D D
A B | C D
A C | B D
B C | A D

Each line is a different parallel region (or you can use barrier between lines).

You may find it handy to construct an array of the pairings

A different method, which is more computationally costly is for each thread to only update one of the two particles (and each thread using an exclusive partition of the particles). This requires the forces to be calculated twice.

!$OMP PARALLEL
nThreads = omp_get_num_threads()
!$OMP END PARALLEL
nPartitions = nThread * 2
nParticlesPerPartition = number_of_particles / nPartitions
if(mod(number_of_particles, nParticlesPerPartition) > 0) nParticlesPerPartition = nParticlesPerPartition + 1
nPairsCombos = nThreads + 1
allocate(pairing(1:nPairsCombos*2, 0:nThreads-1))
! populate the pairs with the starting indexs of the two partitions (may be same)
...
DO iStep = 0, nSteps
  time = delta_time * iStep ! not susceptible to accumulation of round-off errors
  ...
  !$OMP PARALLEL  private(iThread, iPart, ...) ! sans DO
  iThread = omp_get_thread_num)
  ! Note, this is inside parallel region
  DO iPart = 1,:nPairsCombos*2, 2
    iBeginA = paring(iPart, iThread)
    iEndA = MIN(iBeginA+nParticlesPerPartition -1, number_of_particles)
    iBeginB = paring(iPart+1, iThread)
    iEndB = MIN(iBeginB+nParticlesPerPartition -1, number_of_particles)
    if(iBeginA == iBeginB) then
       ... ! same partition
    else
       ... ! different partitions
    end
      !$OMP BARRIER ! syncronize stepping through pairs
   END DO ! still inside parallel region
  !$OMP END PARALLEL
END DO ! iStep

Use the allocatable arrays.

Jim Dempsey

jimdempseyatthecove · ‎03-21-2020

It will be your responsibility to debug the sketch code above. You could calculate the pairing combinations on the fly or use the array method. I suggest using the array of precalculated indexes as this reduces the number of times for recalculation.

Jim Dempsey

grpllrne · ‎03-21-2020

jimdempseyatthecove (Blackbelt) wrote:
With particle-particle interactions using multiple threads, you must avoid having the same particle updated at the same time by two different threads.
IOW you cannot have one thread calculating and updating interactions (e.g. force) between two particles A and B while a different thread is calculating and updating interactions (e.g. force) between two particles one of which is A or B. You will experience a race condition between a read, modify and write by one thread with the same operations by a different thread.
One common method is to partition the particles into tiles, and then have each thread work on combination of tiles that do not include tile combinations used by any of the other threads. Example using 2 threads is to partition number of particles to 2*nThreads (in this case 4)
Partitions 4
A B C D

thread 0 | 1

A A | C C
B B | D D
A B | C D
A C | B D
B C | A D
Each line is a different parallel region (or you can use barrier between lines).

You may find it handy to construct an array of the pairings

A different method, which is more computationally costly is for each thread to only update one of the two particles (and each thread using an exclusive partition of the particles). This requires the forces to be calculated twice.
!$OMP PARALLEL
nThreads = omp_get_num_threads()
!$OMP END PARALLEL
nPartitions = nThread * 2
nParticlesPerPartition = number_of_particles / nPartitions
if(mod(number_of_particles, nParticlesPerPartition) > 0) nParticlesPerPartition = nParticlesPerPartition + 1
nPairsCombos = nThreads + 1
allocate(pairing(1:nPairsCombos*2, 0:nThreads-1))
! populate the pairs with the starting indexs of the two partitions (may be same)
...
DO iStep = 0, nSteps
  time = delta_time * iStep ! not susceptible to accumulation of round-off errors
  ...
  !$OMP PARALLEL  private(iThread, iPart, ...) ! sans DO
  iThread = omp_get_thread_num)
  ! Note, this is inside parallel region
  DO iPart = 1,:nPairsCombos*2, 2
    iBeginA = paring(iPart, iThread)
    iEndA = MIN(iBeginA+nParticlesPerPartition -1, number_of_particles)
    iBeginB = paring(iPart+1, iThread)
    iEndB = MIN(iBeginB+nParticlesPerPartition -1, number_of_particles)
    if(iBeginA == iBeginB) then
       ... ! same partition
    else
       ... ! different partitions
    end
      !$OMP BARRIER ! syncronize stepping through pairs
   END DO ! still inside parallel region
  !$OMP END PARALLEL
END DO ! iStep
  
Use the allocatable arrays.

Jim Dempsey

Even without parallelization i'm with the same issue. I got velocities in the scale of 1e+3 while i should getting velocities in magnitudes of 1e-1 after 1e-5s.

Isn't very clear why a thread would calculate the contact of a particle more than one time. Because the way i arranged the do's seems to be right (without parallelization) and i thought that the "first particle do" (DO i=1, number_of_particles) would be divided by the number of threads of my cpu and if follow the code would'nt have problem the order that will be calculated the contact of particles. In parallelization is only computed the forces.

Thank you

grpllrne · ‎03-21-2020

FortranFan wrote:
Quote:
Gropelli, Henrique wrote:

.. I cannot too print in binary ( i got stackoverflow) but this subroutine in a friend code with more than 300.000 particles worked perfectly.. but in my code no. But in ascii it prints... it is really strange... Honestly i don't know where is the error neither where i search it. ..

@Gropelli, Henrique,
Can you first disclose some basic info re: your effort? e.g., whether it's part of some educational course (homework assignment/exam, final project, etc.) and if so, you would be honoring the terms of the instructor and/or the educational institution by seeking guidance/assistance on a peer-to-peer forum like this one?
Thanks,

Hello!

I'm about to graduate in mechanical engineering. This is a out-of-class research with the purpose to consolidate/apply my knowledge. My advisor is very busy due to the situation related to covid-19 and university ,then, as i have more a computational than mechanical problem i decided to post here.

thank you

jimdempseyatthecove · ‎03-22-2020

Not seeing your entire code, it is difficult to ascertain where your problem is. Sketch of general issue that I am talking about

do i=1,nParticles ! loop 1
do j=i+1,nParticles ! loop 2
! calculate interaction force once, apply to both particles
calcForces(i,j) ! particle(i) += interactaction; particle(j) -= interaction
end do
end do
do i=1,nParticles ! loop3
applyForces(i,j)
end do
Should loop 1 be parallelized (assume 2 threads)

0000000000000000000011111111111111111111 (slice of particle starting points, index i)
x000000000000000000000000000000000000000 (iteration of j for thread 0)
                    x1111111111111111111 (iteration of j for thread 1)

There is a potential for conflicting r/m/w access to the same particle, at the same time, by both threads.

It is not clear as to what your code is doing, in particular with your choice of variable number_of_walls. If this is your means for tiling, then your undisclosed code may not be all particles interact with each other, but rather all particles interact with neighboring particles (or something like that). If the latter is true, then the shared wall boundaries between the walls processed by differing threads can experience race conditions for read, modify, write operations.

Jim Dempsey

jimdempseyatthecove · ‎03-22-2020

Can you supply a sample xxx.vtk file?

Jim Dempsey

JohnNichols · ‎03-22-2020

Henrique:

1. You need to understand what this group of experts can and cannot do. 2. You need to understand what is reasonable to ask of this group. 3. You are a graduate student and are thus obliged to follow normal academic rules, I say this from having chaired close to 120 graduate committees.

4. You need to give these experts your hypothesis for the problem. Every problem you solve has to have an underlying hypothesis - why are you studying the problem.

5. You need to provide your underlying algorithm -- if the algorithm is bad so is the code. You need to provide the entire code and a decent sample file. Of course that places it in the public domain and then anyone can use it, so if is commercially aimed then you on the wrong forum. You need to supply the name of the paper or book where you obtained the idea from.

Short of that -- you are asking someone to shoot a can in the dark with their eyes closed. Statistically it is possible but not probable.

Trust me I have done this many times, you need to understand the limitations and address them if you want the problem solved. If you keep going the way you are going you will not succeed except on your own.

John

jimdempseyatthecove · ‎03-22-2020

>>I got velocities in the scale of 1e+3 while i should getting velocities in magnitudes of 1e-1 after 1e-5s.

insert a test in your DO J loops to detect when fcont and/or accumulating force gets out of range. Break on that condition. something like

if(norm2(force(i,:)) > force_bug .or. norm2(force(j,:)) > force_bug .or norm2(fcont) > fcont_bug) then
print *,"Bug" ! place break here
endif

Run the Debug break test in serial mode until you find why the serial mode is producing unreasonable results.

You may find that initial values are buggered up, or you have a coding error that produces weird results.

Minor optimization issue:

Change:

        DO i=1, number_of_particles
...
            DO j=1, number_of_particles
                IF (i>=j) CYCLE
...
            END DO

To:

        DO i=1, number_of_particles
...
            DO j=i+1, number_of_particles
...
            END DO

**** parallel race condition

force(i,:) = force(i,:) + fcont
force(j,:) = force(j,:) - fcont

While the i index of different threads will not overlap, the j index will overlap

Prior to tiling your algorithm, you can test the parallel code (with severe performance impact) by placing a critical section around those two statements.

!$omp critical
force(i,:) = force(i,:) + fcont
force(j,:) = force(j,:) - fcont
!$omp end critical

Proper tiling can eliminate the critical section.

*** Get the simplified code working first, then optimize

Your EULER_EXPLICIT subroutine is indeed Euler Integration
Your EULER_IMPLICIT subroutine is better known as Adams Integration

Jim Dempsey

grpllrne · ‎03-22-2020

Nichols, John wrote:
Henrique:
1. You need to understand what this group of experts can and cannot do. 2. You need to understand what is reasonable to ask of this group. 3. You are a graduate student and are thus obliged to follow normal academic rules, I say this from having chaired close to 120 graduate committees.
4. You need to give these experts your hypothesis for the problem. Every problem you solve has to have an underlying hypothesis - why are you studying the problem.
5. You need to provide your underlying algorithm -- if the algorithm is bad so is the code. You need to provide the entire code and a decent sample file. Of course that places it in the public domain and then anyone can use it, so if is commercially aimed then you on the wrong forum. You need to supply the name of the paper or book where you obtained the idea from.
Short of that -- you are asking someone to shoot a can in the dark with their eyes closed. Statistically it is possible but not probable.
Trust me I have done this many times, you need to understand the limitations and address them if you want the problem solved. If you keep going the way you are going you will not succeed except on your own.
John

John,

My expectation in this group was to get a help for solving a problem which is a coding/computational problem. As I said the problem isn’t in the algorithm, the mechanics, etc. But as how the arrays were being allocated added to that i’m more than 3 weeks debugging this problem. I already asked for advice to my professor and in the means that were available, which I didn’t succeed. Then strictly following rules of my university, I thought that being concise and showing parts of code would be easier and reasonable, I don’t have any commercial interest, I just tried to make easier (and save time) the process to get some solution. Discrete elements method is common as like finite elements, there are many open source simulation programs. This code is a simple application for learning more of Fortran and to apply what I learnt thorough these years without any ethical implication.

grpllrne · ‎03-22-2020

jimdempseyatthecove (Blackbelt) wrote:
>>I got velocities in the scale of 1e+3 while i should getting velocities in magnitudes of 1e-1 after 1e-5s.
insert a test in your DO J loops to detect when fcont and/or accumulating force gets out of range. Break on that condition. something like
if(norm2(force(i,:)) > force_bug .or. norm2(force(j,:)) > force_bug .or norm2(fcont) > fcont_bug) then
print *,"Bug" ! place break here
endif
Run the Debug break test in serial mode until you find why the serial mode is producing unreasonable results.
You may find that initial values are buggered up, or you have a coding error that produces weird results.
Minor optimization issue:
Change:

        DO i=1, number_of_particles
...
            DO j=1, number_of_particles
                IF (i>=j) CYCLE
...
            END DO

To:

        DO i=1, number_of_particles
...
            DO j=i+1, number_of_particles
...
            END DO
**** parallel race condition

                    force(i,:) = force(i,:) + fcont
                    force(j,:) = force(j,:) - fcont

While the i index of different threads will not overlap, the j index will overlap

Prior to tiling your algorithm, you can test the parallel code (with severe performance impact) by placing a critical section around those two statements.

!$omp critical
                    force(i,:) = force(i,:) + fcont
                    force(j,:) = force(j,:) - fcont
!$omp end critical

Proper tiling can eliminate the critical section.

*** Get the simplified code working first, then optimize

Your EULER_EXPLICIT subroutine is indeed Euler Integration
Your EULER_IMPLICIT subroutine is better known as Adams Integration

Jim Dempsey

Yes, you're right. It is better to correct the code before optimizing. The explanations about parallelization helped a lot.

I did the debug again and i cannot locate why the calculation of forces is getting wrong. The numbers of overlap are ok, but forces go beyond 600N while the expected maximum is 20-30N.

In opposition to these case, i have done again validation tests with a few particles with both subroutines and i got adequate results.

thank you

obs.:please rename "input31.vtk_.txt" to "input31.vtk"

mecej4 · ‎03-23-2020

Yes, there are probably errors in the physical model, as well as inconsistencies between the model and the code that is supposed to implement the calculations. In the presence of those errors, you are imposing rather absurd loads on the code. It is as if one is learning to skate with two sacks of sand on the shoulders. Here is how you can go about fixing the code.

1. Remove non-standard extensions unless you have a good reason to use them. For example, on line 262, you apply an arithmetic operator to a logical variable:

IF (gravity_is_on == .TRUE.

2. Turn off the absurdly excessive output. Your program writes a 12.6 MB file every time step, and you have asked for 10,000 time steps. That means that your program, if allowed to run to completion, would write 126 GB of output. In the presence of errors in the model or programming, that's 126 GB of garbage.

3. Develop some idea of what the program should be doing. Instrument the program to catch divergences between expectations and actual behavior, and terminate the execution of the program as soon as the output becomes meaningless.

4. Hold off on parallelization until the program is generating reasonable output.

Here is what I found, by adding a couple of lines to monitor the maximum velocity. For each time, I print out the number of the particle with the largest magnitude of the velocity, and the value of that magnitude.

 enter input filename without .vtk
input31
 Solver :
 explicit

 Number of particles =        98688
 Final_time, delta_time, delta_time_to_prin =    5.00000000000000
  5.000000000000000E-004  5.000000000000000E-004
 Time =     0.0000  15485  4.5561E+01
 Time =     0.0005   5054  2.8854E+02
 Time =     0.0010   7665  5.7819E+03
 Time =     0.0015  30213  9.7662E+03
 Time =     0.0020  30213  9.7662E+03
 Time =     0.0025  30213  9.7662E+03
 Time =     0.0030  30213  9.7662E+03
 Time =     0.0035  30213  9.7662E+03
 Time =     0.0040  30213  9.7661E+03
 Time =     0.0045  30213  9.7661E+03
 Time =     0.0050  30213  9.7661E+03
 Time =     0.0055  30213  9.7661E+03
 Time =     0.0060  30213  9.7661E+03
forrtl: error (200): program aborting due to control-C event

As you can see, particle 30213 reached a terminal velocity of 9.7661E+03 after just a few time steps. If you let the calculation proceed further, the outcome is very simple. That particle will travel out of the galaxy, or collide with something along the path that is not represented in your model.

There are various variables with "wall" in their name. Are those walls asleep on the job?

jimdempseyatthecove · ‎03-23-2020

I haven't completely diagnosed the problem, mecej4 gives good advice. Additional comments:

            norm_position = ABS(NORM2((preceding_position(i,:)-preceding_position(j,:))))
            overlap = radius(i) + radius(j) - norm_position
            IF (overlap>0.0_dp) THEN
                normal_dir = (preceding_position(i,:)-preceding_position(j,:))/norm_position

1) NORM2 returns a positive length, no need to ABS the sqrt of the sum of the squares of the separation of the particles.
2) You would have a fundamental problem should the separation of the two particles approach or equal 0.0 as the result of the /norm_position go bad. You will have to decide what you want to do. Note, this often happens when the integration step size is too large and permits the velocity of the particles to intrude too far into the radii before being repulsed. Consider using a variable integration step size where the size is proportional to the minimum separation (but >0.0)
3) The Release build maybe applying /fast for Floating Point operations. I suggest you add /fp:precise to produce the full precision of sqrt (and other intrinsic functions). Only after your data looks correct, you can then experiment with the lesser precise (but faster) methods.

Jim Dempsey

JohnNichols · ‎03-23-2020

Gropelli, Henrique wrote:
Quote:
Nichols, John wrote:

Henrique:
1. You need to understand what this group of experts can and cannot do. 2. You need to understand what is reasonable to ask of this group. 3. You are a graduate student and are thus obliged to follow normal academic rules, I say this from having chaired close to 120 graduate committees.
4. You need to give these experts your hypothesis for the problem. Every problem you solve has to have an underlying hypothesis - why are you studying the problem.
5. You need to provide your underlying algorithm -- if the algorithm is bad so is the code. You need to provide the entire code and a decent sample file. Of course that places it in the public domain and then anyone can use it, so if is commercially aimed then you on the wrong forum. You need to supply the name of the paper or book where you obtained the idea from.
Short of that -- you are asking someone to shoot a can in the dark with their eyes closed. Statistically it is possible but not probable.
Trust me I have done this many times, you need to understand the limitations and address them if you want the problem solved. If you keep going the way you are going you will not succeed except on your own.
John

John,
My expectation in this group was to get a help for solving a problem which is a coding/computational problem. As I said the problem isn’t in the algorithm, the mechanics, etc. But as how the arrays were being allocated added to that i’m more than 3 weeks debugging this problem. I already asked for advice to my professor and in the means that were available, which I didn’t succeed. Then strictly following rules of my university, I thought that being concise and showing parts of code would be easier and reasonable, I don’t have any commercial interest, I just tried to make easier (and save time) the process to get some solution. Discrete elements method is common as like finite elements, there are many open source simulation programs. This code is a simple application for learning more of Fortran and to apply what I learnt thorough these years without any ethical implication.

My point is that when you are new to a group such as this one, which is filled with old timers who learnt Fortran from punch cards, I am suggesting that you should have said that upfront. If you look at FortranFan's questions - you can see the issue, we are not here to solve student's homework, so you should say -- this is not homework I am not graded on this problem.

I would still give the background to the problem so these people understand your goals. People help because they feel involved.

JohnNichols · ‎03-23-2020

PS: The algorithm is always important.

There is a great story of Srinivasa Ramanujan FRS. He was in a class and the Mathematics Professor at Trinity College put up a problem on the board and said something to the effect of "I have been working on this problem for 20 years" SR solved the problem on the board. Professor Hardy is reputed to have torn into RS about what he did. There are ways of saying things. This may be an urban legend I was not there, but RS solved many unsolvable problems of the day.

John

problems with large arrays due to probable memory "leak"