- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The latest post on any Intel forum explaining that there is a limit on the size of static arrays in C++ programs linked under Windows was in 2012, It explained that even with 64-bit Windows, the Portable Exective is 32-bit and this results in a 2GB limit that prevents the use of very large static arrays. Is there any progress here - have Microsoft provided a 64-bit Portable Executive and if so how can one compilie a larger sized image? My Intel C++ compiler is 11.1.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You are still expected to use new or malloc and variants in order to overcome the 2GB static object limit on X64.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Maurice,
Following Tim's lead, you might consider something along the line of:
#include "stdafx.h" template<typename T> struct Array { T* a; Array() { a = NULL; } Array(size_t n) {a = new T;} ~Array() { if(a) delete [] a; } T& operator[](size_t i) { return a; } }; const int dim1=100; const int dim2=200; Array<int> A(dim1); Array<int[dim1]> B(dim2); int _tmain(int argc, _TCHAR* argv[]) { for(int i=0; i<dim1; ++i) { A = i; for(int j=0;j<dim2;++j) { B = i*j; } } return 0; }
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Notes on the suggestion above:
Part of the problem with allocable arrays is you have to allocate them. When creating numerous of these arrays, the hard part is in remembering to insert the allocations into the code somewhere (and remember to delete if that is necessary too). This typically require editing or maintaining multiple sections of code. Statically declared arrays get around this, but as you experienced, they have limitations on the static size.
Using a template provides for the allocation to be automatic at program initialization time.
Code generation is efficient. The MS VC++ is quiet compact, but easier to understand than Intel C++
MS C++ for(int i=0; i<dim1; ++i) { 000000013FE11000 xor ecx,ecx 000000013FE11002 xor r10d,r10d 000000013FE11005 xor r11d,r11d 000000013FE11008 nop dword ptr [rax+rax] A = i; 000000013FE11010 mov rax,qword ptr [A (13FE135D8h)] 000000013FE11017 xor edx,edx 000000013FE11019 mov r8,r10 000000013FE1101C mov dword ptr [r11+rax],ecx 000000013FE11020 mov r9d,0C8h 000000013FE11026 nop word ptr [rax+rax] for(int j=0;j<dim2;++j) { B= i*j; 000000013FE11030 mov rax,qword ptr [B (13FE135D0h)] 000000013FE11037 add r8,4 000000013FE1103B mov dword ptr [r8+rax-4],edx 000000013FE11040 add edx,ecx 000000013FE11042 dec r9 000000013FE11045 jne main+30h (13FE11030h) int main(int argc, char* argv[]) { for(int i=0; i<dim1; ++i) { 000000013FE11047 inc ecx 000000013FE11049 add r11,4 000000013FE1104D add r10,190h 000000013FE11054 cmp ecx,64h 000000013FE11057 jl main+10h (13FE11010h) } }
The above is not optimized to use vector instructions
The Intel code is below, don't be agogged with the amount of code, I will highlight the inner loop later
for(int j=0;j<dim2;++j) { 000000013F811017 mov eax,4 for(int i=0; i<dim1; ++i) { 000000013F81101C xor r9d,r9d B= i*j; 000000013F81101F movdqa xmm1,xmmword ptr [__xi_z+28h (13F8131C0h)] 000000013F811027 xor r8d,r8d 000000013F81102A movaps xmmword ptr [rsp+20h],xmm6 for(int j=0;j<dim2;++j) { 000000013F81102F movd xmm0,eax 000000013F811033 pshufd xmm2,xmm0,0 A = i; 000000013F811038 mov rax,qword ptr [A (13F815180h)] 000000013F81103F mov ebx,r9d 000000013F811042 mov dword ptr [rax+r9*4],ebx for(int j=0;j<dim2;++j) { 000000013F811046 mov rax,r8 000000013F811049 add rax,qword ptr [B (13F815188h)] 000000013F811050 mov rcx,rax 000000013F811053 and rcx,0Fh 000000013F811057 mov ecx,ecx 000000013F811059 test ecx,ecx 000000013F81105B je main+99h (13F811099h) 000000013F81105D test cl,3 000000013F811060 jne main+17Dh (13F81117Dh) 000000013F811066 neg ecx 000000013F811068 xor r10d,r10d 000000013F81106B add ecx,10h 000000013F81106E xor edx,edx 000000013F811070 shr ecx,2 000000013F811073 mov rax,r8 B = i*j; 000000013F811076 mov r11,qword ptr [B (13F815188h)] for(int j=0;j<dim2;++j) { 000000013F81107D inc r10d B = i*j; 000000013F811080 mov dword ptr [r11+rax],edx for(int j=0;j<dim2;++j) { 000000013F811084 add edx,ebx 000000013F811086 add rax,4 000000013F81108A cmp r10d,ecx 000000013F81108D jb main+76h (13F811076h) 000000013F81108F mov rax,r8 000000013F811092 add rax,qword ptr [B (13F815188h)] 000000013F811099 mov edx,ecx 000000013F81109B lea r10d,[rcx+1] 000000013F81109F neg edx 000000013F8110A1 lea r11d,[rcx+2] 000000013F8110A5 and edx,3 000000013F8110A8 movd xmm6,ebx 000000013F8110AC neg edx 000000013F8110AE movd xmm0,ecx 000000013F8110B2 movd xmm3,r10d 000000013F8110B7 lea r10d,[rcx+3] 000000013F8110BB movd xmm5,r11d 000000013F8110C0 add edx,0C8h 000000013F8110C6 punpcklqdq xmm0,xmm3 000000013F8110CA movd xmm4,r10d 000000013F8110CF punpcklqdq xmm5,xmm4 000000013F8110D3 pshufd xmm4,xmm6,0 B = i*j; 000000013F8110D8 movdqa xmm3,xmm4 for(int j=0;j<dim2;++j) { 000000013F8110DC shufps xmm0,xmm5,88h B = i*j; 000000013F8110E0 psrlq xmm3,20h for(int j=0;j<dim2;++j) { 000000013F8110E5 mov r10d,edx B = i*j; 000000013F8110E8 movaps xmm5,xmm0 000000013F8110EB movdqa xmm6,xmm4 000000013F8110EF psrlq xmm5,20h 000000013F8110F4 pmuludq xmm6,xmm0 000000013F8110F8 paddd xmm0,xmm2 000000013F8110FC pmuludq xmm5,xmm3 000000013F811100 pand xmm6,xmm1 000000013F811104 psllq xmm5,20h 000000013F811109 por xmm6,xmm5 000000013F81110D movdqa xmmword ptr [rax+rcx*4],xmm6 for(int j=0;j<dim2;++j) { 000000013F811112 add rcx,4 000000013F811116 cmp rcx,r10 000000013F811119 jb main+0E8h (13F8110E8h) 000000013F81111B mov eax,ebx 000000013F81111D imul eax,edx 000000013F811120 cmp edx,0C8h 000000013F811126 lea rcx,[r8+rdx*4] 000000013F81112A jae main+147h (13F811147h) B = i*j; 000000013F81112C mov r10,qword ptr [B (13F815188h)] for(int j=0;j<dim2;++j) { 000000013F811133 inc edx B = i*j; 000000013F811135 mov dword ptr [r10+rcx],eax for(int j=0;j<dim2;++j) { 000000013F811139 add eax,ebx 000000013F81113B add rcx,4 000000013F81113F cmp edx,0C8h 000000013F811145 jb main+12Ch (13F81112Ch) for(int i=0; i<dim1; ++i) { 000000013F811147 inc r9 000000013F81114A add r8,190h 000000013F811151 cmp r9,64h 000000013F811155 jb main+38h (13F811038h) } }
Now the inner loop (SSE)
B= i*j; 000000013F8110E8 movaps xmm5,xmm0 000000013F8110EB movdqa xmm6,xmm4 000000013F8110EF psrlq xmm5,20h 000000013F8110F4 pmuludq xmm6,xmm0 000000013F8110F8 paddd xmm0,xmm2 000000013F8110FC pmuludq xmm5,xmm3 000000013F811100 pand xmm6,xmm1 000000013F811104 psllq xmm5,20h 000000013F811109 por xmm6,xmm5 000000013F81110D movdqa xmmword ptr [rax+rcx*4],xmm6 for(int j=0;j<dim2;++j) { 000000013F811112 add rcx,4 000000013F811116 cmp rcx,r10 000000013F811119 jb main+0E8h (13F8110E8h)
Note, the inner loop is performing four B=
And now AVX inner loop
B= i*j; 000000013FA110D2 vpmulld xmm3,xmm1,xmm2 000000013FA110D7 vpaddd xmm2,xmm2,xmm0 000000013FA110DB vmovdqu xmmword ptr [rax+rcx*4],xmm3 for(int j=0;j<dim2;++j) { 000000013FA110E0 add rcx,4 000000013FA110E4 cmp rcx,r10 000000013FA110E7 jb main+0D2h (13FA110D2h)
Note, the inner loop is performing eight B=
For small arrays you would not use the template arrays (unless you have a huge amount of them).
For large arrays, the template arrays (allocated arrays) experience no computational penalty over static arrays. IOW there is no excuse not to use them (other than for a "goofy" declaration syntax).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim,
Unfortunately this does not seem to work with five dimensional arrays. It compiles OK, but crashes part way through when run.
#include <stdarg.h>
#include <iostream>
#include <cstdlib>
#include <ctime>
#include <string>
#include <tchar.h>
template<typename T>
struct Array {
T* a;
Array() { a = NULL; }
Array(size_t n) {a = new T
~Array() { if(a) delete [] a; }
T& operator[](size_t i) { return a; }
};
const int dim1=3;
const int dim2=266;
const int dim3=436;
const int dim4=274;
const int dim5=2;
Array<float> A(dim1);
Array<float[dim1]> B(dim2);
Array<float[dim1][dim2]> C(dim3);
Array<float[dim1][dim2][dim3]> D(dim4);
Array<float[dim1][dim2][dim3][dim4]> E(dim5);
int _tmain(int argc, _TCHAR* argv[])
{
for(int i=0; i<dim1; ++i) {
A = 0;
for(int j=0;j<dim2;++j) {
B
for(int k=0;k<dim3;++k){
C
for(int l=0;l<dim4;++l){
D
for(int m=0;m<dim5;++m){
E
std::cout<<i<<" "<<j<<" "<<k<<" "<<l<<" "<<m<<" "<<E
}
}
}
}
}
return 0;
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have split the 5-dimensional arrays into 3 dimensional arrays, but there seems to be a size limit smaller than the limit for static arrays. It is true that the run-time penalty is small, but since the problem is basically one of array size, I am no further forward.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tim Prince wrote:
You are still expected to use new or malloc and variants in order to overcome the 2GB static object limit on X64.
I absolutely agree. I do alloc on stack maximally like 128 bytes, for like "char tmp[128];" for purposes like "s(n)printf()". Why you just can't use dynamic allocation instead of BSS or stack allocation?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
BTW, when I started to learn C, and later C++ like 20 years ago, the book had read "big static arrays are a big no-no". This was a DOS era, and there was limit of 64 KiB.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Marián "VooDooMan" Meravý wrote:I agree with both of you. I use stack allocation either for simple test cases or for small short programs.
Quote:
Tim Prince wrote:You are still expected to use new or malloc and variants in order to overcome the 2GB static object limit on X64.
I absolutely agree. I do alloc on stack maximally like 128 bytes, for like "char tmp[128];" for purposes like "s(n)printf()". Why you just can't use dynamic allocation instead of BSS or stack allocation?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
intel@ruperttaylor.com wrote:
Jim,
Unfortunately this does not seem to work with five dimensional arrays. It compiles OK, but crashes part way through when run.
...
I'd expect that to crash if you compiled that as Win32 (32-bit)
Array(size_t n) {a = new T
Adding the assert will detect allocation errors (32-bit), but unfortunately will not inform you if you have exceeded the Page File Limit. This would occur later when you "first touch" the memory with the E
An additional issue that may be a gotcha
You are using for(int i=0;..., same with j,k,l,m.
Instead you should be using for(size_t i=0;... same with j,k,l,m
Otherwise (on 64-bit) the index computations would be performed using signed 32-bit values.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Intel platforms usually work much faster with 32-bit signed indexing. As Jim pointed out, that will limit the size of arrays which can be indexed (8GB for 32-bit data types). I guess int64_t may work better than size_t as it surely will still permit indexing any physically possible array.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Rupert,
Sorry for the delay in the response. I was away at IDF 2014 in San Francisco last week. Your program has an error in the index positioning. Use the following:
#include <stdio.h> #include <memory.h> #include <iostream> #include <cstdlib> #include <ctime> #include <string> #include <tchar.h> template<typename T> struct Array { T* a; Array() { a = NULL; } Array(__int64 n) {a = new T;} ~Array() { if(a) delete [] a; } T& operator[](__int64 i) { return a; } }; const __int64 dim1=3; const __int64 dim2=266; const __int64 dim3=436; const __int64 dim4=274; const __int64 dim5=2; Array<float> A(dim1); Array<float[dim2]> B(dim1); Array<float[dim2][dim3]> C(dim1); Array<float[dim2][dim3][dim4]> D(dim1); Array<float[dim2][dim3][dim4][dim5]> E(dim1); int main(int argc, char* argv[]) { for(__int64 i=0; i<dim1; ++i) { A = 0; for(__int64 j=0;j<dim2;++j) { B = 0; for(__int64 k=0;k<dim3;++k){ C =0; for(__int64 l=0;l<dim4;++l){ D =0; for(__int64 m=0;m<dim5;++m){ E =0; if(k+l+m==0) std::cout<<i<<" "<<j<<" "<<k<<" "<<l<<" "<<m<<" "<<E <<"\n"; } } } } } return 0; }
The first index is the one enclosed in the (dim1) of the declaration. IOW
Array <float[dim_j][dim_k][dim_l][dim_m]> E(dim_i); ... E= 0;
Note, E returns a float[dim_j][dim_k][dim_l][dim_m].
This then orders the (apparent) indexes i to m from left to right.
Jim Dempsey
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page