Large Arrays: using "allocatable" versus static declaration

etmeyer · ‎11-30-2009

I am just a scientific programmer, so I hope someone out there with CS knowledge can educate me on a problem I am having.

I have a 1000 x 1000 x 1000 array (for this test - in other cases it may need to be different)

If I use this code:

program test
IMPLICIT NONE
DOUBLE PRECISION, allocatable :: stupid(:,:,:)
DOUBLE PRECISION, allocatable :: stupid2(:,:,:)
ALLOCATE(stupid(1000,1000,1000),stupid2(1000,1000,1000))
stupid = 0
print*, "shape is ", SHAPE(stupid)
end program test

The program hangs and bogs my whole system ( a new 64 bit dell with plenty of memory)

But if I do the same thing,with static declarations, it's super fast!

program test
IMPLICIT NONE
DOUBLE PRECISION :: stupid(1000,1000,1000)
stupid = 0
print*, "shape is ", SHAPE(stupid)
end program test

Why is that???

I'm sure I've used large allocatable arrays before, though only 2 dimensions.

I'm probably not understanding some difference in the amount of memory needed for one or the other. By my naive calculation, a 16 bit double, if the array was 1e9 (that's 1000^3) would take up 2 GB of space. That's a lot, but not more than my system can handle (I have 4 gigs of memory) - and still doesn't explain why the static is so fast and has no problems??

Ron_Green · ‎11-30-2009

Quoting - etmeyer

I am just a scientific programmer, so I hope someone out there with CS knowledge can educate me on a problem I am having.

I have a 1000 x 1000 x 1000 array (for this test - in other cases it may need to be different)

If I use this code:

program test
IMPLICIT NONE
DOUBLE PRECISION, allocatable :: stupid(:,:,:)
DOUBLE PRECISION, allocatable :: stupid2(:,:,:)
ALLOCATE(stupid(1000,1000,1000),stupid2(1000,1000,1000))
stupid = 0
print*, "shape is ", SHAPE(stupid)
end program test

The program hangs and bogs my whole system ( a new 64 bit dell with plenty of memory)

But if I do the same thing,with static declarations, it's super fast!

program test
IMPLICIT NONE
DOUBLE PRECISION :: stupid(1000,1000,1000)
stupid = 0
print*, "shape is ", SHAPE(stupid)
end program test

Why is that???

I'm sure I've used large allocatable arrays before, though only 2 dimensions.

I'm probably not understanding some difference in the amount of memory needed for one or the other. By my naive calculation, a 16 bit double, if the array was 1e9 (that's 1000^3) would take up 2 GB of space. That's a lot, but not more than my system can handle (I have 4 gigs of memory) - and still doesn't explain why the static is so fast and has no problems??

DOUBLE PRECISION use 8 bytes each, not 16bits.

And in the first example, you have 2 arrays whereas in the second you have just one.

Make sure your OS is a 64bit version: uname -a

uname -a
Linux spdr65 2.6.16.60-0.21-smp #1 SMP Tue May 6 12:41:02 UTC 2008 x86_64 x86_64 x86_64 GNU/Linux

look for "_64" for a 64bit OS.

Compile with:

-mcmodel=medium -shared-intel

to monitor your memory usage, use another window on the system and run:

vmstat 2

watching columns "free" and all 3 columns under "swap".

ron

etmeyer · ‎12-01-2009

Thank you for the correction. Either way, the total memory allocated (theoretically) should not be the problem unless there is a drastic difference between static and allocatable that I'm not appreciating.

I installed 64 bit ubuntu 9.10 on this computer myself - but here you are:

[0802][meyer@inspiron17]$ uname -a
Linux inspiron17 2.6.31-15-generic #50-Ubuntu SMP Tue Nov 10 14:53:52 UTC 2009 x86_64 GNU/Linux

I have 4 gigs of memory.

I did the vmstat test for the second (allocatable) example. The static example is so fast I can't run vmstat after a.out, though I could start it before perhaps.

output:
[1044][meyer@inspiron17]$ vmstat 2
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
2 14 359272 26064 1560 339024 6 25 78 217 432 458 10 6 82 2
0 14 406448 26552 1404 337480 900 23666 3516 23666 1764 1943 7 6 14 72
4 9 452524 25404 344 328284 838 23140 2304 23140 2049 1962 8 12 31 49
0 20 482300 26632 332 321184 1524 15044 2768 15044 2158 1933 11 7 0 82
0 17 500800 26044 332 317976 1570 9450 2634 9450 1925 1750 4 7 0 88
0 15 517424 26000 712 318500 1136 8532 2408 8532 2112 1917 9 9 0 81
0 17 551240 25832 724 308024 792 17058 1304 17066 3851 1650 4 15 0 81
1 20 600468 26192 856 303600 1476 24836 2716 24948 2458 1754 3 10 0 88
0 15 607724 26016 856 305840 2046 3864 3428 3864 1966 1673 4 8 0 89
0 14 644244 26204 812 303604 1536 18654 2454 18700 3794 1771 3 13 0 84
0 13 700480 25232 604 302780 1026 28306 2162 28546 2239 1644 3 13 0 84
0 13 731684 26236 596 298632 928 15766 1996 15862 2897 2121 3 11 0 86
1 13 840248 25824 716 265092 204 54316 466 54328 5541 1839 3 24 22 51
0 17 876140 30544 712 253800 1596 18092 2492 18128 2473 2960 6 12 0 82
1 14 918636 25724 992 247460 1212 21532 1962 21588 2762 3414 8 16 2 74
0 9 993676 26612 884 235944 650 37580 986 37580 2786 2563 6 15 16 64
0 8 1056992 26028 884 199512 524 31720 1006 31720 3558 3178 7 16 25 52
2 9 1128396 26276 876 194960 636 35816 790 35838 3171 2923 7 15 31 47
0 12 1180856 27540 876 185420 1326 26440 1374 26454 2957 2327 5 12 1 83
2 12 1233432 26552 776 168260 1010 26404 2250 26438 2542 3412 7 15 0 78
1 9 1259012 25848 712 160884 1310 12920 2498 12920 2377 2577 4 10 0 86
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 10 1290292 26088 876 149932 1284 15736 2152 15736 2366 2918 6 11 0 83
0 9 1396524 26496 736 145704 262 53164 446 53288 5344 3064 8 21 16 54
0 13 1436492 26872 732 147736 1606 20070 2044 20070 2674 2757 5 11 7 76
0 14 1457844 25108 712 147028 1806 10802 2574 10802 1908 2058 4 10 0 86
0 14 1465196 26428 684 145140 1648 3750 3368 3750 1944 2435 5 8 0 87
0 13 1475860 40496 336 147772 2078 5482 3244 5564 2045 2817 5 11 7 77
0 9 1476396 43908 720 152168 1648 388 3074 388 1667 3209 8 8 35 50<<--- KILLED THE TEST HERE
0 6 792224 3501452 720 155172 1898 0 3972 0 1522 3580 6 16 37 41
0 13 791636 3496500 896 154900 1838 0 3410 42 1477 3352 8 7 32 54
0 9 791044 3489060 1300 158388 1898 0 3906 0 1501 2573 4 5 34 57
0 7 790252 3479468 1308 160988 2020 0 3380 0 1770 2392 6 6 39 50
1 8 789084 3462340 1460 176572 2446 0 2808 20 2121 2739 6 8 23 63
0 6 787800 3461704 1460 171328 2868 0 3560 0 2173 3582 6 7 39 48
0 6 786824 3454608 1464 174560 2118 0 3556 0 1972 2215 3 7 39 51
2 2 784776 3448552 1480 175536 2376 0 2886 30 2244 4216 8 6 36 50
^C
[1047][meyer@inspiron17]$

etmeyer · ‎12-01-2009

I did the vmstat for the static case as well, just started it before the execution:

[1047][meyer@inspiron17]$ vmstat 2
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 0 606068 3117092 29384 269668 14 52 89 240 444 485 10 6 82 2
5 0 605172 3117844 29384 268100 372 0 372 0 1726 4973 29 7 63 1
0 0 605168 3107284 29392 277872 0 0 0 24 2042 4152 12 8 79 0
0 0 605164 3111064 29400 273528 2 0 2 12 1936 4046 9 7 82 1
2 0 605164 3113236 29400 271412 6 0 6 0 1737 3674 9 5 86 0
0 1 605104 3098116 29556 285388 34 0 5866 0 1858 3651 9 7 62 22 <-- approximate start
2 0 605100 3084252 29576 299360 0 0 8244 38 1821 3738 11 7 47 36
0 0 605100 3079400 29576 304628 28 0 364 0 1754 4245 9 10 79 2 <-- approximate end
0 0 605100 3077964 29576 306172 0 0 0 2 1582 3300 8 5 87 0
0 0 605096 3081140 29584 301980 0 0 0 18 1294 3280 7 7 86 0
3 0 605096 3086752 29592 295948 0 0 0 68 1530 3724 12 7 81 1
^C
[1052][meyer@inspiron17]$

Sorry for the formatting loss - not sure how to correct that. The free memory never dips like it does for the allocatable case. So it's clearly taking a ton more memory to use allocatable. I'm assuming these numbers are in kB. In the allocatable test, available memory jumps from a mere 40 MB to over 3 GB after killing the executable. The largest swap value I saw was only around 38 MB for allocatable, and only 8 MB for the static... I'm not sure how to interpret most of this. where is the available memory going?

Even with two 1000^3 arrays, total memory usage should be around 2 GB.

Maybe the questions is not why the allocatable hogs so much, but why the static does not?

Ron_Green · ‎12-02-2009

Static data is stored in the static data segment BSS, as you can see by the 'size' command:

rwgreen@spdr65:~/quad/rwgreen/forums/70229> ifort -o static -mcmodel=medium -shared-intel static.f90 -O0
rwgreen@spdr65:~/quad/rwgreen/forums/70229> size static
text data bss dec hex filename
2655 696 4000000032 4000003383 ee6b3537 static

rwgreen@spdr65:~/quad/rwgreen/forums/70229> ifort -o alloc -mcmodel=medium -shared-intel alloc.f90 -O0
rwgreen@spdr65:~/quad/rwgreen/forums/70229> size alloc
text data bss dec hex filename
3587 840 8 4435 1153 alloc

We generally discourage using static data. Many operating systems, including Windows versions, limit the static segment to 2GB. Dynamically allocated data will not have such restrictions.

Now given that your total data allocation is the same in both cases AT -O0 I cannot explain why one version would thrash your system and other would not. Note that I said "at -O0". Optimization can play tricks, however. For example, in the static case:

program test
IMPLICIT NONE
REAL :: stupid(1000,1000,1000)
stupid = 0.0_4
print*, "shape is ", SHAPE(stupid)
end program test

(I had to switch to single precision since my system only has 6GB.) Also note the explicit typing of the constant 0.0_4 to KIND(4).

Now, what is interesting is that -O2, the compiler figures out that you really never use the data in 'stupid', so it doesn't actually allocate it in BSS at all! Observe:

rwgreen@spdr65:~/quad/rwgreen/forums/70229> ifort -o static -mcmodel=medium -shared-intel static.f90 -O2
rwgreen@spdr65:~/quad/rwgreen/forums/70229> size static
text data bss dec hex filename
2419 696 8 3123 c33 static

Thus, this version of the code runs instantly because it never does initialize the data in stupid. What I can't explain is why the allocateable version can't figure out the same optimization:

program test
IMPLICIT NONE
integer i,j,k
REAL, allocatable :: stupid(:,:,:)
ALLOCATE(stupid(1000,1000,1000))
stupid = 0.0_4
print*, "shape is ", SHAPE(stupid)
end program test

If you compile this at -O2 (or leave off the -O option, which will get you the default of -O2) it still does the allocation and initialization and hence runs considerably more slowly. I would guess the compiler sees the ALLOCATE and figures you REALLY REALLY do want to allocate the data and thus does not optimize it away in the way the static case did.

The takeaway here - if you actually use the stupid array in the code, the 2 cases should run approximately equally. There will be some difference in runtime behaviour. The static data is 'allocated' at program load - the loader carves up a chunk of data in BSS for the array at load time. The allocateable case makes an OS 'malloc()' call to allocate the data in heap (different space in the process address space) after the process starts.

Other effects can come into play also from optimization. The compiler may choose to use a vectorized library call, intel_fast_memset() to do the initialization, OR it may explicitly create a vectorized loop, OR it may decide to use a serial nested loop the way you would do this the old F77 way.

Since you are a real scientist trying to do work rather than a computer scientist interested in the inners of compiler optimization, let's step back a few steps and evaluate your goals: Are you trying to determine if it's better to use static arrays versus using allocatable arrays? I think that is what you are trying to answer, yes? I've seen many a good physicist thrashing around on simple tests like this that really don't mimic their code, and end up being misled by obscure compiler optimizations causing anomalous behavior that the real code would never exhibit.

I assume you have some real code that does real work. Forget this little testcase. There should be no performance disadvantage to using allocatable arrays versus static arrays in a real code. There also will not be any performance advantage. It should be a wash. The advantage of allocatable, again, is that there is no limit to the size of your arrays. Static, as I mentioned, is limited to 2GB by various Linux distros and by all Windows versions. For that reason alone you should choose dynamically allocated arrays.

ron

etmeyer · ‎12-02-2009

Wow, very interesting. That explains the fact that my frustrated move of turning everything into static arrays did not help at all. They don't call it the curse of many dimensions for nothing... time to turn this project over to the supercomputer, I think.

One other question, if the answer is readily available:

I frequently come up with questions regarding optimization for which I am probably not educated enough to design good tests, and I'm sure the answers are out there. For example, I'm running a subroutine which repeatedly loads into memory a large array from a text file. This seems not ideal, since Fortran passes everything through pointers, yet it also causes re-usability issues to have a pointer passed to a subroutine when it is not needed externally. In any case, I have usually assumed that loading from a text file is a much slower process than keeping something in memory, but I don't really know. This is really basic, I realize, but also outside the scope of my self-education. Are there any references for this kind of thing?

In any case, thanks very much for the help.

Ron_Green · ‎12-03-2009

Quoting - etmeyer

Wow, very interesting. That explains the fact that my frustrated move of turning everything into static arrays did not help at all. They don't call it the curse of many dimensions for nothing... time to turn this project over to the supercomputer, I think.

One other question, if the answer is readily available:

I frequently come up with questions regarding optimization for which I am probably not educated enough to design good tests, and I'm sure the answers are out there. For example, I'm running a subroutine which repeatedly loads into memory a large array from a text file. This seems not ideal, since Fortran passes everything through pointers, yet it also causes re-usability issues to have a pointer passed to a subroutine when it is not needed externally. In any case, I have usually assumed that loading from a text file is a much slower process than keeping something in memory, but I don't really know. This is really basic, I realize, but also outside the scope of my self-education. Are there any references for this kind of thing?

In any case, thanks very much for the help.

Your assumptions are correct: reading text files uses Fortran "formatted" IO. This is considerably slower than reading the data from an "unformatted" file. You will find the file formatting or "form" in the OPEN statement:
OPEN( unit=42,file="whatever",form="formatted" ....) or some similar OPEN

the form="unformatted" is typically 1-2 orders of magnitude faster. It reads/writes data in the native binary format of the computer you are on.

So if form="unformatted" is 1-2 orders of magnitude faster, why would anyone use form="formatted"? Well, a text file or formatted file can be read on any computer and is portable. Scientists need to be able to share data files with colleagues and using a text file guarantees that no matter what computer someone has this file can be read. The unfortunate thing is that form="unformatted" is both system dependent and compiler dependent. The order of storing bytes from a 4 or 8 byte real varies in a property called 'endian' - computers are either little-endian or big-endian meaning that the bytes are stored from the low-end up (little endian) or from the high-end down (big endian). PCs are little-endian but older SGI and other systems were big-endian. So form="unformatted" files could not be shared between these two systems without some conversion mechanism to swap byte orders. To see the extent of the problem, look up the CONVERT= specifier on the OPEN statement supported by Intel Fortran. You'll see we've had to set up conversion methods for a variety of other formats. And to make matters worse, even on a platform like a PC, compiler vendors do not agree on how to set up file and record markers within unformatted files. Thus, an unformatted file created by IFORT may not be able to be read by gfortran, g95 or PGI.

But there is hope. You can continue to use formatted text files and just accept the slowness as a tradeoff with portability and sharing. OR there are at least 2 efforts to make vendor/platform neutral data files along with service routines to create/read these file formats. If you are interested, take a look at HDF and NetCDF projects. These 2 file IO layers are used extensively in the scientific community:

NetCDF: http://www.unidata.ucar.edu/software/netcdf/

HDF group: http://www.hdfgroup.org/HDF5/

TimP · ‎12-03-2009

Quoting - Ronald W. Green (Intel)

And to make matters worse, even on a platform like a PC, compiler vendors do not agree on how to set up file and record markers within unformatted files. Thus, an unformatted file created by IFORT may not be able to be read by gfortran, g95 or PGI.

ifort, PGI, and gfortran seem to have reached reasonable compatibility on linux x86_64. Don't count on it for obsolete versions, however. gfortran for Windows is still trying to solve the problem of supporting>2GB files, but I wouldn't count on hdf5 working there either.

Ron_Green · ‎12-03-2009

And as for memory versus data files: memory reads are around 3 orders of magnitude faster.

So why not put all the data hard-coded in the program using DATA statements or other explicit initialization?

Certainly for data that does not change over time this is something to consider. For example, phase tables of water (steam tables) will never change over time (hopefully). Good candidate for putting into a program in a table without reading from disk.

On the other extreme: data that define a particular problem, and will be changed frequently to define new problems. You don't want to have to edit files, recompile, etc for each problem you run. These should be read in from disk.

There are data sets that fall in between these two - materials tables for steel, composites, ceramics. These are updated regularly as new materials are added by manufacturers and older materials are removed from the market. I've seen these hardcoded and also read in from material property data files. If the code models a particular widget and that widget's material sets are fixed and unchanging - could hardcode them. If the widgets get redesigned once a year and new materials come and go - data file.