- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello.
I have a fortran multithreaded (omp) code compiled with ifort that runs 24/7 doing
fluid flow calculations. It uses about 6GB of ram and it is run as a series of
self-submitting jobs. The processor is an i7 3930k with 16GB ddr3 @ 1600mhz and
the OS is opensuse but run in konsole. No other application is run on this system
and it not connected to the net.
Each one of the series of these jobs does a virtually identical number of operations -
exept when there is an occasional database dump - so in theory the cpu time used should
be about constant for each of the segments of the run,
The curious thing is that the code runs the fastest after a reboot and
over the subsequent runs the cpu time used creeps up. Over 2 days the cpu
has increased by about 6%. Isolated runs have occured being ~ 20% slower than
average,
Are there any explanations why there this cpu creep?
thanks
--
ps: timing is measured with omp_get_wtime(). also the linux time is used
to time the executable.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>is run as a series of self-submitting jobs.
It could be any number of things.
Can you determine if the later runs (submissions) are incurring additional page faults?
The i7-3930K has Turbo Boost - or stated converse: Slow down when hot. Turbo max frequency is 3.8GHz, base frequency is 3.2 (~15% slower).
If the application is writing log file(s), the file size may be affecting performance.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the suggestions. Because the whole programme fits very comfortably
into the main memory, I did not follow the page fault count. Now I have.
Excessive temp throttling is not an issue. The processor is water cooled,
the side panels of the box are always off and I monitor the temp all the
time. Max temp at present is about 72 C.
File-writing also is not a problem. A run writes every 15 minutes some simple
stats and values from a few monitor points. Size of the end file ~30MB most of
it written at startup . The 4.1GB unformatted dump/read at the end/beginning of
each run is not counted when per step cpu is calculated with omp_get_wtime() .
Finally, i used the /usr/bin/time utility to follow the page faults. So far it has
reported 0 swaps. The number of *major* page faults varies between 0 and
28 over 4-hour segments of elapsed time, while there are gazillions of *minor*
page faults. But none of these correlates with the time step variation.
--
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>is run as a series of self-submitting jobs.
Is this performed via a shell script?
Or...
FortranDoLoop: do j=1,nJobs
call DoJob(JobList(j))
end do
Or...
FortranFirstJob:
...
if(AnotherJob) SYSTEMQQ(NextJob)
Where NextJob can be a copy of the first job with different arguments, and ending with the SYSTEMQQ for another next job.
The last case might be problematic as it creates a job-count series of nested processes.
If you are using either of the first two technique, then this may be indicative of Linux running a background process.
Have you run "top" in a second console window?
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
J,
I am using a bash script to resubmit the executable.
In a file called (say) job there are statements like
#!/bin/bash
ulimit -s8000000
export OMP_NUM_THREADS=6
cp fort.7 fort.8
/usr/bin/time/ ./a.out > OUT
if [ $? -ne 0 ]; then
mv job jjob
exit
fi
./job &
exit
Anyway Jim, you have been very helpul as always. This one is not worth the candle though.
I am at the moment trying to co-opt two gpu cards using opencl to accelerate things.
Here the potential for floating point acceleration is substantial but for the slow pci link.
--
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A,
Have you considered Xeon Phi? There are some available on ebay for a reasonable price. And a whole bunch will become available shortly as Knights Landing is launched into new production. Existing users upgrading will need to find a secondary market. The only downsides are your motherboard has to support PCIe x16 devices with address window larger than 4GB. Some of the older MB do not have a BIOS feature and setting for this. And you also need to supply cooling. A "passive" cooled Xeon Phi means it ships without fan. You must supply the fans.
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the suggestion. I have never considered the Phi knowing it to be expensive,
I checked my x79 mobo, and it does not allow pcie addressing larger than 4GB. Even
if the older generation of Phis become very cheap, the main worry is the intel compiler
pricing which could become an order of magnitude more expensive than a decent gpu.
I tested two gpus and the only bottleneck is the data transfer rates over pcie-2 (I am testing on an old
system sofar), while their double precision computing speeds are impressive. But the gpus are cheap
and support pcie-3 and u can stick two or 3 of them in the box and try to choreograph the data transfers
between them even if their computing capacity is underused.
I would be interested to know the phi results (whatever model) for the following - if at all possible:
- elapsed time for double precision, r-2-cplx, in place fft of a batch of 50000 vectors each 1536 (real) elements long.
- elapsed time of the inverse transform of the above
- elapsed time to send the 50000x1538*8 bytes to the Phi and get them back.
--
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I cannot estimate the fft, but for large transfers ~5GB/s (out of 6) seems doable. ~0.123s per direction.
If the batches are truly separate, then you would want to compute batch x while transferring in batch x+1 and transferring out batch x-1.
Look at: http://www.colfax-intl.com/nd/ for white papers.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Near-asymptotic transfer rates are not very informative. In the case of the ffts i asked
earlier, the ratio of the elapsed time for the one-way data transfer (pcie-2) to that of the
kernel execution on a r9 280x is more than 19. So, there is no advantage in overlapping
transfers with calculations in this case; possibly the contrary.
Thank you for your time.
--
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page