- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello everyone,
I want to do transposition for 60720x60720 matrix.
In project properties, I turned on Maximize Speed plus Higher Level Optimizations (/O3) and Yes (/Qparallel). I also chose the configuration is release x64.
But my program isn't optimized. Please kindly help me!
program Console4
program Console4
implicit none
double precision,allocatable,dimension(:,:):: P,PP
integer(kind=8) :: tclock1, tclock2, clock_rate
real(kind=8) :: elapsed_time
integer nstate
nstate=60720
allocate(P(nstate,nstate),PP(nstate,nstate))
PP=1d0
write(*,*) "1"
call system_clock(tclock1)
P=TRANSPOSE(PP)
call system_clock(tclock2, clock_rate)
elapsed_time = float(tclock2 - tclock1) / float(clock_rate)
write(*,*) elapsed_time
end program Console4
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am not sure it is possible to claim 100% of all CPUs. When I ran your program as is (well, after adding calls to cpu_time) on my Windows laptop, I got roughly 100% of one processor, that is 12% of the total (4 physical processors and hyperthreading enabled). I modified the program as attached to use OpenMP instead. The timings were:
- Your program: 505 seconds wall clock, 475 seconds CPU.
- My OpenMP program: 144 seconds wall clock, 627 seconds CPU, 60% of the total CPU, so roughly three times faster
I have attached the program, it is not optimised in any way other than using explicit loops and OpenMP. Transposing a very large matrix in this way has a very awkward memory access pattern. There are very probably much more efficient ways to do so.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am not sure it is possible to claim 100% of all CPUs. When I ran your program as is (well, after adding calls to cpu_time) on my Windows laptop, I got roughly 100% of one processor, that is 12% of the total (4 physical processors and hyperthreading enabled). I modified the program as attached to use OpenMP instead. The timings were:
- Your program: 505 seconds wall clock, 475 seconds CPU.
- My OpenMP program: 144 seconds wall clock, 627 seconds CPU, 60% of the total CPU, so roughly three times faster
I have attached the program, it is not optimised in any way other than using explicit loops and OpenMP. Transposing a very large matrix in this way has a very awkward memory access pattern. There are very probably much more efficient ways to do so.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you very much!
Can you kindly explain to me what is different between "elapsed time" and "cpu time" in your program?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The routine system_clock returns the time as experienced by the user - in many respects the most important metric. I mean: can you wait for the calculation to finish or should you get a cup of coffee, have lunch or sleep over it? The routine cpu_time returns the amount of time the CPUs in the machine have spent on doing your calculation. If you have a multithreaded/parallel program, that time is likely to be longer than the wall clock time, indicating you have sped up the program. Thus the ratio between the two is a measure for the success you have had in parallellising the program.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@nvh10 wrote:
Hello everyone,
I want to do transposition for 60720x60720 matrix.
In project properties, I turned on Maximize Speed plus Higher Level Optimizations (/O3) and Yes (/Qparallel). I also chose the configuration is release x64.
But my program isn't optimized. Please kindly help me!
program Console4
program Console4 implicit none double precision,allocatable,dimension(:,:):: P,PP integer(kind=8) :: tclock1, tclock2, clock_rate real(kind=8) :: elapsed_time integer nstate nstate=60720 allocate(P(nstate,nstate),PP(nstate,nstate)) PP=1d0 write(*,*) "1" call system_clock(tclock1) P=TRANSPOSE(PP) call system_clock(tclock2, clock_rate) elapsed_time = float(tclock2 - tclock1) / float(clock_rate) write(*,*) elapsed_time end program Console4
As side note -- there is no **hardware notion of 100% CPU usage. Windows Task Manager probably uses the time-based IP address (RIP register) sampling of your matrix-transposition program (probably relying on the timer interrupt or software interrupt). The timer interval usually is 1 milisecond hence 1000 thousand samples will be collected per second. For if the highest density of the samples i.e. IP addresses fall into address space of your program then postprocessing module will (it is simplified view) calculate the percentage of samples out of 1000 as belonging to your program address space.
For more accurate timing measurements you shall rely upon RDTSCP instruction or CPU_CLK_UNHALTED.THREAD fixed performance event (manual instrumentation) or alternatively you may profile your program (on Linux) by using perf stat (counting mode) or VTune (Windows,Linux) working in counting-mode.
**Unless there are some metrics which compute the usage and/or occupancy of the functional units per specific fixed reference cycles based interval and this is a function of some basic-block of the code e.g. for-loop or whole function body scope.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
One issue, that I have forgotten to mention is the potential skewness of timing measurements for so lengthy (in time) procedure or function. Your main thread probably runs at "Normal Priority" i.e. level 8 (IIRC) and that means the high probability of swapping out by the Windows Scheduler at seemingly random times. The timing subroutines are not bound by no means to the specific thread and will measure the execution time of "foreign" thread which took possesion of core executing your main thread.
In addition to stated above there are various power and thermal events related to CPU work and which may contaminate the measurements results. Lastly you shall not take a single measurement because of large variance existing between the consecutive runs of the program. The observed distribution rarely is a normal and often is multimodal right tail skewed.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page