Solved: Transposition of matrix: cannot use 100% CPU usage even turn on O3 optimization

nvh10 · ‎07-13-2021

Hello everyone,

I want to do transposition for 60720x60720 matrix.

In project properties, I turned on Maximize Speed plus Higher Level Optimizations (/O3) and Yes (/Qparallel). I also chose the configuration is release x64.

But my program isn't optimized. Please kindly help me!

program Console4

 program Console4

    implicit none
    double precision,allocatable,dimension(:,:):: P,PP
    integer(kind=8) :: tclock1, tclock2, clock_rate
    real(kind=8) :: elapsed_time
    integer nstate
    nstate=60720
    allocate(P(nstate,nstate),PP(nstate,nstate))
    PP=1d0
    write(*,*) "1"
    call system_clock(tclock1)
    P=TRANSPOSE(PP)
    call system_clock(tclock2, clock_rate)
    elapsed_time = float(tclock2 - tclock1) / float(clock_rate)
    write(*,*) elapsed_time
    end program Console4

Arjen_Markus · ‎07-13-2021

I am not sure it is possible to claim 100% of all CPUs. When I ran your program as is (well, after adding calls to cpu_time) on my Windows laptop, I got roughly 100% of one processor, that is 12% of the total (4 physical processors and hyperthreading enabled). I modified the program as attached to use OpenMP instead. The timings were:

Your program: 505 seconds wall clock, 475 seconds CPU.
My OpenMP program: 144 seconds wall clock, 627 seconds CPU, 60% of the total CPU, so roughly three times faster

I have attached the program, it is not optimised in any way other than using explicit loops and OpenMP. Transposing a very large matrix in this way has a very awkward memory access pattern. There are very probably much more efficient ways to do so.

View solution in original post

Arjen_Markus · ‎07-13-2021

I am not sure it is possible to claim 100% of all CPUs. When I ran your program as is (well, after adding calls to cpu_time) on my Windows laptop, I got roughly 100% of one processor, that is 12% of the total (4 physical processors and hyperthreading enabled). I modified the program as attached to use OpenMP instead. The timings were:

Your program: 505 seconds wall clock, 475 seconds CPU.
My OpenMP program: 144 seconds wall clock, 627 seconds CPU, 60% of the total CPU, so roughly three times faster

I have attached the program, it is not optimised in any way other than using explicit loops and OpenMP. Transposing a very large matrix in this way has a very awkward memory access pattern. There are very probably much more efficient ways to do so.

nvh10 · ‎07-13-2021

Thank you very much!

Can you kindly explain to me what is different between "elapsed time" and "cpu time" in your program?

Arjen_Markus · ‎07-14-2021

The routine system_clock returns the time as experienced by the user - in many respects the most important metric. I mean: can you wait for the calculation to finish or should you get a cup of coffee, have lunch or sleep over it? The routine cpu_time returns the amount of time the CPUs in the machine have spent on doing your calculation. If you have a multithreaded/parallel program, that time is likely to be longer than the wall clock time, indicating you have sped up the program. Thus the ratio between the two is a measure for the success you have had in parallellising the program.

Bernard · ‎07-13-2021

@nvh10 wrote:

Hello everyone,

I want to do transposition for 60720x60720 matrix.

In project properties, I turned on Maximize Speed plus Higher Level Optimizations (/O3) and Yes (/Qparallel). I also chose the configuration is release x64.

But my program isn't optimized. Please kindly help me!

program Console4
 program Console4

    implicit none
    double precision,allocatable,dimension(:,:):: P,PP
    integer(kind=8) :: tclock1, tclock2, clock_rate
    real(kind=8) :: elapsed_time
    integer nstate
    nstate=60720
    allocate(P(nstate,nstate),PP(nstate,nstate))
    PP=1d0
    write(*,*) "1"
    call system_clock(tclock1)
    P=TRANSPOSE(PP)
    call system_clock(tclock2, clock_rate)
    elapsed_time = float(tclock2 - tclock1) / float(clock_rate)
    write(*,*) elapsed_time
    end program Console4

As side note -- there is no **hardware notion of 100% CPU usage. Windows Task Manager probably uses the time-based IP address (RIP register) sampling of your matrix-transposition program (probably relying on the timer interrupt or software interrupt). The timer interval usually is 1 milisecond hence 1000 thousand samples will be collected per second. For if the highest density of the samples i.e. IP addresses fall into address space of your program then postprocessing module will (it is simplified view) calculate the percentage of samples out of 1000 as belonging to your program address space.

For more accurate timing measurements you shall rely upon RDTSCP instruction or CPU_CLK_UNHALTED.THREAD fixed performance event (manual instrumentation) or alternatively you may profile your program (on Linux) by using perf stat (counting mode) or VTune (Windows,Linux) working in counting-mode.

**Unless there are some metrics which compute the usage and/or occupancy of the functional units per specific fixed reference cycles based interval and this is a function of some basic-block of the code e.g. for-loop or whole function body scope.

nvh10 · ‎07-13-2021

Thank you for your comment!

Bernard · ‎07-13-2021

You are welcome!!

Bernard · ‎07-14-2021

One issue, that I have forgotten to mention is the potential skewness of timing measurements for so lengthy (in time) procedure or function. Your main thread probably runs at "Normal Priority" i.e. level 8 (IIRC) and that means the high probability of swapping out by the Windows Scheduler at seemingly random times. The timing subroutines are not bound by no means to the specific thread and will measure the execution time of "foreign" thread which took possesion of core executing your main thread.

In addition to stated above there are various power and thermal events related to CPU work and which may contaminate the measurements results. Lastly you shall not take a single measurement because of large variance existing between the consecutive runs of the program. The observed distribution rarely is a normal and often is multimodal right tail skewed.