Near-asymptotic transfer

a_b_1 · ‎03-11-2015

Hello.

I have a fortran multithreaded (omp) code compiled with ifort that runs 24/7 doing

fluid flow calculations. It uses about 6GB of ram and it is run as a series of

self-submitting jobs. The processor is an i7 3930k with 16GB ddr3 @ 1600mhz and

the OS is opensuse but run in konsole. No other application is run on this system

and it not connected to the net.

Each one of the series of these jobs does a virtually identical number of operations -

exept when there is an occasional database dump - so in theory the cpu time used should

be about constant for each of the segments of the run,

The curious thing is that the code runs the fastest after a reboot and

over the subsequent runs the cpu time used creeps up. Over 2 days the cpu

has increased by about 6%. Isolated runs have occured being ~ 20% slower than

average,

Are there any explanations why there this cpu creep?

thanks

--

ps: timing is measured with omp_get_wtime(). also the linux time is used

to time the executable.

jimdempseyatthecove · ‎03-11-2015

>>is run as a series of self-submitting jobs.

It could be any number of things.

Can you determine if the later runs (submissions) are incurring additional page faults?

The i7-3930K has Turbo Boost - or stated converse: Slow down when hot. Turbo max frequency is 3.8GHz, base frequency is 3.2 (~15% slower).

If the application is writing log file(s), the file size may be affecting performance.

Jim Dempsey

a_b_1 · ‎03-13-2015

Thanks for the suggestions. Because the whole programme fits very comfortably

into the main memory, I did not follow the page fault count. Now I have.

Excessive temp throttling is not an issue. The processor is water cooled,

the side panels of the box are always off and I monitor the temp all the

time. Max temp at present is about 72 C.

File-writing also is not a problem. A run writes every 15 minutes some simple

stats and values from a few monitor points. Size of the end file ~30MB most of

it written at startup . The 4.1GB unformatted dump/read at the end/beginning of

each run is not counted when per step cpu is calculated with omp_get_wtime() .

Finally, i used the /usr/bin/time utility to follow the page faults. So far it has

reported 0 swaps. The number of *major* page faults varies between 0 and

28 over 4-hour segments of elapsed time, while there are gazillions of *minor*

page faults. But none of these correlates with the time step variation.

--

jimdempseyatthecove · ‎03-13-2015

>>is run as a series of self-submitting jobs.

Is this performed via a shell script?

Or...
FortranDoLoop: do j=1,nJobs
call DoJob(JobList(j))
end do

Or...
FortranFirstJob:
...
if(AnotherJob) SYSTEMQQ(NextJob)

Where NextJob can be a copy of the first job with different arguments, and ending with the SYSTEMQQ for another next job.

The last case might be problematic as it creates a job-count series of nested processes.

If you are using either of the first two technique, then this may be indicative of Linux running a background process.

Have you run "top" in a second console window?

Jim Dempsey

a_b_1 · ‎03-13-2015

J,

I am using a bash script to resubmit the executable.

In a file called (say) job there are statements like

#!/bin/bash

ulimit -s8000000

export OMP_NUM_THREADS=6

cp fort.7 fort.8

/usr/bin/time/ ./a.out > OUT

if [ $? -ne 0 ]; then

mv job jjob

exit

fi

./job &

exit

Anyway Jim, you have been very helpul as always. This one is not worth the candle though.

I am at the moment trying to co-opt two gpu cards using opencl to accelerate things.

Here the potential for floating point acceleration is substantial but for the slow pci link.

--

jimdempseyatthecove · ‎03-14-2015

A,

Have you considered Xeon Phi? There are some available on ebay for a reasonable price. And a whole bunch will become available shortly as Knights Landing is launched into new production. Existing users upgrading will need to find a secondary market. The only downsides are your motherboard has to support PCIe x16 devices with address window larger than 4GB. Some of the older MB do not have a BIOS feature and setting for this. And you also need to supply cooling. A "passive" cooled Xeon Phi means it ships without fan. You must supply the fans.

Jim

a_b_1 · ‎03-16-2015

Thanks for the suggestion. I have never considered the Phi knowing it to be expensive,

I checked my x79 mobo, and it does not allow pcie addressing larger than 4GB. Even

if the older generation of Phis become very cheap, the main worry is the intel compiler

pricing which could become an order of magnitude more expensive than a decent gpu.

I tested two gpus and the only bottleneck is the data transfer rates over pcie-2 (I am testing on an old

system sofar), while their double precision computing speeds are impressive. But the gpus are cheap

and support pcie-3 and u can stick two or 3 of them in the box and try to choreograph the data transfers

between them even if their computing capacity is underused.

I would be interested to know the phi results (whatever model) for the following - if at all possible:

- elapsed time for double precision, r-2-cplx, in place fft of a batch of 50000 vectors each 1536 (real) elements long.

- elapsed time of the inverse transform of the above

- elapsed time to send the 50000x1538*8 bytes to the Phi and get them back.

--

jimdempseyatthecove · ‎03-16-2015

I cannot estimate the fft, but for large transfers ~5GB/s (out of 6) seems doable. ~0.123s per direction.

If the batches are truly separate, then you would want to compute batch x while transferring in batch x+1 and transferring out batch x-1.

Look at: http://www.colfax-intl.com/nd/ for white papers.

Jim Dempsey

a_b_1 · ‎03-18-2015

Near-asymptotic transfer rates are not very informative. In the case of the ffts i asked

earlier, the ratio of the elapsed time for the one-way data transfer (pcie-2) to that of the

kernel execution on a r9 280x is more than 19. So, there is no advantage in overlapping

transfers with calculations in this case; possibly the contrary.

Thank you for your time.

--

creeping cpu time