Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
28608 Discussions

Parallel programming with OMP and computing pi


I've asked a question some time ago about the co-array tutorial (

I've followed the tutorial to learn the basics of co-array, and now I'm trying to do an equivalent program using OpenMP instead of co-arrays to learn the basics of OMP. 


But I've noticed something strange going on. I don't know what is wrong.


When I use the sequential version, the program takes 22~28 seconds to run. But when I run the OMP version, it takes more than 30 seconds, in general. I don't understand why it takes longer... the co-array version is taking 12 seconds to run with two images (my processor has only two cores).

My code is below. Can anyone help me? Am I doing something wrong?


I'm compiling using only the qopenmp option with ifort.


program mcpi_using_openmp_lib
use omp_lib
implicit none
real*8 , parameter :: actual_pi = 3.141592653589793238d0
integer*8 i , num_trials/600000000/ , total/0/ , clock_start,clock_end,clock_rate
real*8 x , y , computed_pi

call omp_set_num_threads(2)

print '(A,I0,A)', "Computing pi using ",num_trials," trials sequentially"

call SYSTEM_CLOCK(clock_start)
call random_seed()

!$OMP parallel private(x,y) 
!$OMP DO reduction(+:total)
do i=1,num_trials
    call random_number(x)
    call random_number(y)
    if ( (x*x)+(y*y) <= 1.d0 ) total = total + 1
!$OMP end parallel

computed_pi = 4.d0*real(total,8)/real(num_trials,8)
print '(A,G0.8,A,G0.3)', "Computed value of pi is ", computed_pi, &
    ", Relative Error: ",ABS((computed_pi-actual_pi)/actual_pi)
call SYSTEM_CLOCK(clock_end,clock_rate)
print '(A,G0.3,A)', "Elapsed time is ", &
    REAL(clock_end-clock_start)/REAL(clock_rate)," seconds"



Labels (2)
0 Kudos
10 Replies
Valued Contributor III

Screenshot 2023-01-24 103536.png

I have to turn off the num_threads to get it to compile, the linker throws an error.  This is on a dell precision that is working hard on two accelerometers at the same time.  


0 Kudos

@JohnNichols , did you compile with /Qopenmp? You need that for the !$OMP directives to be recognized, otherwise they are just comments. Also, need that compiler option to pull in the library that contains the routine, SET_OMP_NUM_THREADS.


0 Kudos
Valued Contributor III

Screenshot 2023-01-25 095149.png

I created a new Fortran program in VS 2022, added the code and followed what I thought was the manual instruction to add the -qopenmp.  

I then placed the picture and knew someone would advise me of my mistake.  

I tend to agree with Andrew the problem is to small to be useful.  I was using IFORT and WIN 32.  

A two core machine is going to be doing other things, so I do not expect much change.  

I have a program that spins up 4 threads and I time each thread, the 2nd and the 4th are identical except for the data, long term, say a day the 2nd returns in 1.5 seconds and the fourth in 10 seconds.  As long as it is less than 12 seconds it is in front of the other two threads and God smiles at me again.   

Here is the full solution. 

It was just a bit of fun whilst I try and work out how to synchronize two accelerometers to yield the same frequency set for a given vibration.  


0 Kudos
Valued Contributor III

Screenshot 2023-01-25 100238.png

This is 2 accelerometers at the same location on the same computer on two identical programs.  The linear regression is the difference in speed - 1.03 factor.  So the timer in the accelerometer is not returning real time, it is returning consistent time, but not in seconds, one uses the ticks on the computer to calibrate this data and then work out the dodgy factor to adjust the ratio of the FFT steps to real frequency and then compare the original results for frequency response.  Of course you cannot get two accelerometers in one spot and I have not as yet bought a shaker.  One day.  But my desk shakes at 42 Hz.  

The 1.08 is the difference once I add the code to determine the rate of the timing counter.  

0 Kudos
Honored Contributor III

>> I try and work out how to synchronize two accelerometers to yield the same frequency set for a given vibration. 

You can't synchronize the two accelerometers. What you can do is to calibrate each such that you can obtain the correction factor for each accelerometer. IOW each accelerometer has an internal clock. The clock frequency of each accelerometer my (usually does) differ from each other as well as your system clock and the real-world clock. These frequencies may also drift over time or temperature, but we will ignore this for now.

Be aware that any system/device call to the device, say "get time now", will have a variable latency:

a) from the call issued in your application to the reception inside the accelerometer.


b) from the send back in the accelerometer to reception inside the application.

Therefore, your calibration routine must (attempt to) account for this (or at least minimize the variability).

OpenMP threads will have different start times as you enter a parallel region, as well as, potentially be preempted and/or delayed in startup due to other activities on your system.

Therfore, when trying to synchronize activities on multiple threads, do not use a fixed delay in your loop. Instead, obtain the current time (ticks) and then determine the delta time to next interval.

Note, the next interval is:

     StartTime + intervalTime * intervalNumber


     NextTime = NextTime + intervalTime    ! this accumulates roundoff errors


Jim Dempsey


0 Kudos
Valued Contributor III


Sorry I used a poor word choice and you are absolutely correct. 

The program loops and reads the buffer from the accelerometer, in hexadecimal, one has to search through the hexadecimal for the ALPHA characters ST.  Once found you have a data set that is 14 blocks long of acceleration data.  You play with that. 

Theoretically you can ask the accelerometer to send you "the internal timing counter" this is not fixed to real time in any strict sense, certainly not close enough to pass a test on this in Physics 1 at ANU.  They were hard taskmasters.  It is supposed to come back at a rate of 30 to 1 for the acceleration data, it does not, so I have to read the timing register directly, so as not to slow up the process to much I only do that about every 1000 data points.  

Plus there is a horrendous latency time, you really do not know when the readings were taken - the buffer can be 1 to 1200 long.  

So I use the ticks on the clock, each time I write a single record to the CSV file I put the ticks with it.  I also store the ticks in an array of 835584 length - 16384 records in 51 sets.  

I can then do a linear regression on the stored ticks data and get an approximation of the rate of progress on the data stream.  

I have ten accelerometers, to save on NUC costs, this is the expensive part, I want two accelerometers running on the one NUC, as the PI wants to know the difference in stiffness for wood that is 260 years old that has no nails and nail - iron intrusion. 

The slope on the linear regression of the tick data can be used to convert the FFT steps to Frequency.  I can then compare the results for the two accelerometers at a single location, ie on my desk side by side.  

Screenshot 2023-01-25 114814.png

On the output screen I can add the current estimated frequency.  I average the frequency over 4 - 16384 time periods so it does not pop around to much. 

If I plot the output rate for the overall 51 set files, about 8000 per day, I can get the rate difference between the two accelerometers and compare it to the estimate of frequencies and I can compare the estimated frequencies from the results.  

The real issue is the serial read on the accelerometer, it is straight out of the 1980's.

This is the graph of three days monitoring the output for the two instruments, the third on at 1.015  times, reflects the addition of the code to monitor the frequency.  

Screenshot 2023-01-26 121115.png

I do track the timing data from the accelerometer and compare it to the ticks to see if the number is remaining constant and it appears to as long as you take long enough time.

ST.COM manuals tell you all the electronics, but the physics is something else.  

Theoretically the accelerometer is told to run at say 1666 Hz, but you will only get that if you do nothing else, as soon as you have code the buffer read rate slows up.  As long as the read rate is above 1300 I am happy that gives 600 Hz and most structures are quiet above about 400 Hz. 


0 Kudos
Valued Contributor III

This also allows for the drift as I store the current read frequency with the data output.  Yes I measured the drift.  

I am just starting to look at a range finder, to determine the movement of the accelerometer relative to ground so I can solve the integration problem,  over a 2 F temperature difference the drift is statistically significant over night for the range finder.  

It is just a lot of fun, looking for a slow drift in the stiffness of the beams.  

0 Kudos
Valued Contributor III

One runs the read in a continuous thread and adds a second thread to do the FFT every 10 seconds or so on.  Even a Raspberry PI can handle this load.  

0 Kudos
Honored Contributor III

My guess is that the amount of "work" being done in each loop is so small  that the time saving of parallel processing is negated by the overhead (additional work) of managing the different threads.  

0 Kudos

You are doing. a lot of things I would not do.


You start and end the timer around a lot of code.  PRINT is pretty slow.  

Why didn't you just put the start and stop around the loop?  Isn't that what you want?


Next, you have 2 calls to RANDOM_NUMBER inside the parallel region.

RANDOM_NUMBER - It is "thread safe".  But is it "THREAD HOT" - that is, does it allow multiple threads to call it simultaneously OR does it serialize the calls?  I would have to look but I would guess it serializes, that is, when 1 thread is inside it it blocks all other thread calls until that call is complete.  I am sure it serializes, based on the algorithm it uses.

MKL I believe has a fast parallel random number generators - actually several that use different algorithms.

But maybe easier - load up 2 large rank 1 arrays with random numbers.  Outside of the start timer region. 


Not all intrinsic functions are "thread hot", don't assume they are.  Particularly random number generators.  Our classical algorithm depends on 2 seed value which are updated on each call.  Thus if you start with the same seeds you can get the same sequence .  This is needed for reproducibility requirements. 



0 Kudos