Image size limit and Portable Executive

Maurice_T_ · ‎09-05-2014

The latest post on any Intel forum explaining that there is a limit on the size of static arrays in C++ programs linked under Windows was in 2012, It explained that even with 64-bit Windows, the Portable Exective is 32-bit and this results in a 2GB limit that prevents the use of very large static arrays. Is there any progress here - have Microsoft provided a 64-bit Portable Executive and if so how can one compilie a larger sized image? My Intel C++ compiler is 11.1.

TimP · ‎09-06-2014

You are still expected to use new or malloc and variants in order to overcome the 2GB static object limit on X64.

jimdempseyatthecove · ‎09-06-2014

Maurice,

Following Tim's lead, you might consider something along the line of:

#include "stdafx.h"

template<typename T>
struct Array {
 T* a;
 Array() { a = NULL; }
 Array(size_t n) {a = new T;}
 ~Array() { if(a) delete [] a; }
 T& operator[](size_t i) { return a; }
};

const int dim1=100;
const int dim2=200;

Array<int> A(dim1);
Array<int[dim1]> B(dim2);

int _tmain(int argc, _TCHAR* argv[])
{
 for(int i=0; i<dim1; ++i) {
  A = i;
  for(int j=0;j<dim2;++j) {
   B = i*j;
  }
 }
 return 0;
}

Jim Dempsey

jimdempseyatthecove · ‎09-06-2014

Notes on the suggestion above:

Part of the problem with allocable arrays is you have to allocate them. When creating numerous of these arrays, the hard part is in remembering to insert the allocations into the code somewhere (and remember to delete if that is necessary too). This typically require editing or maintaining multiple sections of code. Statically declared arrays get around this, but as you experienced, they have limitations on the static size.

Using a template provides for the allocation to be automatic at program initialization time.

Code generation is efficient. The MS VC++ is quiet compact, but easier to understand than Intel C++

MS C++
 for(int i=0; i<dim1; ++i) {
000000013FE11000  xor         ecx,ecx  
000000013FE11002  xor         r10d,r10d  
000000013FE11005  xor         r11d,r11d  
000000013FE11008  nop         dword ptr [rax+rax]  
  A = i;
000000013FE11010  mov         rax,qword ptr [A (13FE135D8h)]  
000000013FE11017  xor         edx,edx  
000000013FE11019  mov         r8,r10  
000000013FE1101C  mov         dword ptr [r11+rax],ecx  
000000013FE11020  mov         r9d,0C8h  
000000013FE11026  nop         word ptr [rax+rax]  
  for(int j=0;j<dim2;++j) {
   B = i*j;
000000013FE11030  mov         rax,qword ptr [B (13FE135D0h)]  
000000013FE11037  add         r8,4  
000000013FE1103B  mov         dword ptr [r8+rax-4],edx  
000000013FE11040  add         edx,ecx  
000000013FE11042  dec         r9  
000000013FE11045  jne         main+30h (13FE11030h)  

int main(int argc, char* argv[])
{
 for(int i=0; i<dim1; ++i) {
000000013FE11047  inc         ecx  
000000013FE11049  add         r11,4  
000000013FE1104D  add         r10,190h  
000000013FE11054  cmp         ecx,64h  
000000013FE11057  jl          main+10h (13FE11010h)  
  }
 }

The above is not optimized to use vector instructions

The Intel code is below, don't be agogged with the amount of code, I will highlight the inner loop later

  for(int j=0;j<dim2;++j) {
000000013F811017  mov         eax,4  
 for(int i=0; i<dim1; ++i) {
000000013F81101C  xor         r9d,r9d  
   B = i*j;
000000013F81101F  movdqa      xmm1,xmmword ptr [__xi_z+28h (13F8131C0h)]  
000000013F811027  xor         r8d,r8d  
000000013F81102A  movaps      xmmword ptr [rsp+20h],xmm6  
  for(int j=0;j<dim2;++j) {
000000013F81102F  movd        xmm0,eax  
000000013F811033  pshufd      xmm2,xmm0,0  
  A = i;
000000013F811038  mov         rax,qword ptr [A (13F815180h)]  
000000013F81103F  mov         ebx,r9d  
000000013F811042  mov         dword ptr [rax+r9*4],ebx  
  for(int j=0;j<dim2;++j) {
000000013F811046  mov         rax,r8  
000000013F811049  add         rax,qword ptr [B (13F815188h)]  
000000013F811050  mov         rcx,rax  
000000013F811053  and         rcx,0Fh  
000000013F811057  mov         ecx,ecx  
000000013F811059  test        ecx,ecx  
000000013F81105B  je          main+99h (13F811099h)  
000000013F81105D  test        cl,3  
000000013F811060  jne         main+17Dh (13F81117Dh)  
000000013F811066  neg         ecx  
000000013F811068  xor         r10d,r10d  
000000013F81106B  add         ecx,10h  
000000013F81106E  xor         edx,edx  
000000013F811070  shr         ecx,2  
000000013F811073  mov         rax,r8  
   B = i*j;
000000013F811076  mov         r11,qword ptr [B (13F815188h)]  
  for(int j=0;j<dim2;++j) {
000000013F81107D  inc         r10d  
   B = i*j;
000000013F811080  mov         dword ptr [r11+rax],edx  
  for(int j=0;j<dim2;++j) {
000000013F811084  add         edx,ebx  
000000013F811086  add         rax,4  
000000013F81108A  cmp         r10d,ecx  
000000013F81108D  jb          main+76h (13F811076h)  
000000013F81108F  mov         rax,r8  
000000013F811092  add         rax,qword ptr [B (13F815188h)]  
000000013F811099  mov         edx,ecx  
000000013F81109B  lea         r10d,[rcx+1]  
000000013F81109F  neg         edx  
000000013F8110A1  lea         r11d,[rcx+2]  
000000013F8110A5  and         edx,3  
000000013F8110A8  movd        xmm6,ebx  
000000013F8110AC  neg         edx  
000000013F8110AE  movd        xmm0,ecx  
000000013F8110B2  movd        xmm3,r10d  
000000013F8110B7  lea         r10d,[rcx+3]  
000000013F8110BB  movd        xmm5,r11d  
000000013F8110C0  add         edx,0C8h  
000000013F8110C6  punpcklqdq  xmm0,xmm3  
000000013F8110CA  movd        xmm4,r10d  
000000013F8110CF  punpcklqdq  xmm5,xmm4  
000000013F8110D3  pshufd      xmm4,xmm6,0  
   B = i*j;
000000013F8110D8  movdqa      xmm3,xmm4  
  for(int j=0;j<dim2;++j) {
000000013F8110DC  shufps      xmm0,xmm5,88h  
   B = i*j;
000000013F8110E0  psrlq       xmm3,20h  
  for(int j=0;j<dim2;++j) {
000000013F8110E5  mov         r10d,edx  
   B = i*j;
000000013F8110E8  movaps      xmm5,xmm0  
000000013F8110EB  movdqa      xmm6,xmm4  
000000013F8110EF  psrlq       xmm5,20h  
000000013F8110F4  pmuludq     xmm6,xmm0  
000000013F8110F8  paddd       xmm0,xmm2  
000000013F8110FC  pmuludq     xmm5,xmm3  
000000013F811100  pand        xmm6,xmm1  
000000013F811104  psllq       xmm5,20h  
000000013F811109  por         xmm6,xmm5  
000000013F81110D  movdqa      xmmword ptr [rax+rcx*4],xmm6  
  for(int j=0;j<dim2;++j) {
000000013F811112  add         rcx,4  
000000013F811116  cmp         rcx,r10  
000000013F811119  jb          main+0E8h (13F8110E8h)  
000000013F81111B  mov         eax,ebx  
000000013F81111D  imul        eax,edx  
000000013F811120  cmp         edx,0C8h  
000000013F811126  lea         rcx,[r8+rdx*4]  
000000013F81112A  jae         main+147h (13F811147h)  
   B = i*j;
000000013F81112C  mov         r10,qword ptr [B (13F815188h)]  
  for(int j=0;j<dim2;++j) {
000000013F811133  inc         edx  
   B = i*j;
000000013F811135  mov         dword ptr [r10+rcx],eax  
  for(int j=0;j<dim2;++j) {
000000013F811139  add         eax,ebx  
000000013F81113B  add         rcx,4  
000000013F81113F  cmp         edx,0C8h  
000000013F811145  jb          main+12Ch (13F81112Ch)  
 for(int i=0; i<dim1; ++i) {
000000013F811147  inc         r9  
000000013F81114A  add         r8,190h  
000000013F811151  cmp         r9,64h  
000000013F811155  jb          main+38h (13F811038h)  
  }
 }

Now the inner loop (SSE)

   B = i*j;
000000013F8110E8  movaps      xmm5,xmm0  
000000013F8110EB  movdqa      xmm6,xmm4  
000000013F8110EF  psrlq       xmm5,20h  
000000013F8110F4  pmuludq     xmm6,xmm0  
000000013F8110F8  paddd       xmm0,xmm2  
000000013F8110FC  pmuludq     xmm5,xmm3  
000000013F811100  pand        xmm6,xmm1  
000000013F811104  psllq       xmm5,20h  
000000013F811109  por         xmm6,xmm5  
000000013F81110D  movdqa      xmmword ptr [rax+rcx*4],xmm6  
  for(int j=0;j<dim2;++j) {
000000013F811112  add         rcx,4  
000000013F811116  cmp         rcx,r10  
000000013F811119  jb          main+0E8h (13F8110E8h)

Note, the inner loop is performing four B= = I*j initializations per iteration.

And now AVX inner loop

   B = i*j;
000000013FA110D2  vpmulld     xmm3,xmm1,xmm2  
000000013FA110D7  vpaddd      xmm2,xmm2,xmm0  
000000013FA110DB  vmovdqu     xmmword ptr [rax+rcx*4],xmm3  
  for(int j=0;j<dim2;++j) {
000000013FA110E0  add         rcx,4  
000000013FA110E4  cmp         rcx,r10  
000000013FA110E7  jb          main+0D2h (13FA110D2h)

Note, the inner loop is performing eight B= = I*j initializations per iteration

For small arrays you would not use the template arrays (unless you have a huge amount of them).

For large arrays, the template arrays (allocated arrays) experience no computational penalty over static arrays. IOW there is no excuse not to use them (other than for a "goofy" declaration syntax).

Jim Dempsey

Rupert_T_ · ‎09-08-2014

Jim,

Unfortunately this does not seem to work with five dimensional arrays. It compiles OK, but crashes part way through when run.

#include <stdarg.h>
#include <iostream>
#include <cstdlib>
#include <ctime>
#include <string>
#include <tchar.h>

   template<typename T>
   struct Array {
   T* a;
   Array() { a = NULL; }
   Array(size_t n) {a = new T;}
   ~Array() { if(a) delete [] a; }
   T& operator[](size_t i) { return a; }
   };

   const int dim1=3;
   const int dim2=266;
    const int dim3=436;
   const int dim4=274;
   const int dim5=2;

   Array<float> A(dim1);
   Array<float[dim1]> B(dim2);
   Array<float[dim1][dim2]> C(dim3);
   Array<float[dim1][dim2][dim3]> D(dim4);
   Array<float[dim1][dim2][dim3][dim4]> E(dim5);

   int _tmain(int argc, _TCHAR* argv[])
   {
   for(int i=0; i<dim1; ++i) {
      A = 0;

for(int j=0;j<dim2;++j) {
B = 0;

for(int k=0;k<dim3;++k){
C=0;

for(int l=0;l<dim4;++l){
D=0;

                for(int m=0;m<dim5;++m){
            E=0;
            std::cout<<i<<" "<<j<<" "<<k<<" "<<l<<" "<<m<<" "<<E<<"\n";

      }
   }
   }
   }
   }
   return 0;
   }

Rupert_T_ · ‎09-09-2014

I have split the 5-dimensional arrays into 3 dimensional arrays, but there seems to be a size limit smaller than the limit for static arrays. It is true that the run-time penalty is small, but since the problem is basically one of array size, I am no further forward.

Marián__VooDooMan__M · ‎09-09-2014

Tim Prince wrote:

You are still expected to use new or malloc and variants in order to overcome the 2GB static object limit on X64.

I absolutely agree. I do alloc on stack maximally like 128 bytes, for like "char tmp[128];" for purposes like "s(n)printf()". Why you just can't use dynamic allocation instead of BSS or stack allocation?

Marián__VooDooMan__M · ‎09-09-2014

BTW, when I started to learn C, and later C++ like 20 years ago, the book had read "big static arrays are a big no-no". This was a DOS era, and there was limit of 64 KiB.

Bernard · ‎09-10-2014

Marián "VooDooMan" Meravý wrote:

Quote:
Tim Prince wrote:
You are still expected to use new or malloc and variants in order to overcome the 2GB static object limit on X64.

I absolutely agree. I do alloc on stack maximally like 128 bytes, for like "char tmp[128];" for purposes like "s(n)printf()". Why you just can't use dynamic allocation instead of BSS or stack allocation?

I agree with both of you. I use stack allocation either for simple test cases or for small short programs.

jimdempseyatthecove · ‎09-12-2014

intel@ruperttaylor.com wrote:

Jim,

Unfortunately this does not seem to work with five dimensional arrays. It compiles OK, but crashes part way through when run.

...

I'd expect that to crash if you compiled that as Win32 (32-bit)

Array(size_t n) {a = new T; assert(a); }

Adding the assert will detect allocation errors (32-bit), but unfortunately will not inform you if you have exceeded the Page File Limit. This would occur later when you "first touch" the memory with the E=0;

An additional issue that may be a gotcha

You are using for(int i=0;..., same with j,k,l,m.

Instead you should be using for(size_t i=0;... same with j,k,l,m

Otherwise (on 64-bit) the index computations would be performed using signed 32-bit values.

Jim Dempsey

TimP · ‎09-12-2014

Intel platforms usually work much faster with 32-bit signed indexing. As Jim pointed out, that will limit the size of arrays which can be indexed (8GB for 32-bit data types). I guess int64_t may work better than size_t as it surely will still permit indexing any physically possible array.

jimdempseyatthecove · ‎09-15-2014

Rupert,

Sorry for the delay in the response. I was away at IDF 2014 in San Francisco last week. Your program has an error in the index positioning. Use the following:

#include <stdio.h>
#include <memory.h>
#include <iostream>
#include <cstdlib>
#include <ctime>
#include <string>
#include <tchar.h>

template<typename T>
struct Array {
 T* a;
 Array() { a = NULL; }
 Array(__int64 n) {a = new T;}
 ~Array() { if(a) delete [] a; }
 T& operator[](__int64 i) { return a; }
};
const __int64 dim1=3;
const __int64 dim2=266;
const __int64 dim3=436;    
const __int64 dim4=274;    
const __int64 dim5=2;    
Array<float> A(dim1);
Array<float[dim2]> B(dim1);
Array<float[dim2][dim3]> C(dim1);
Array<float[dim2][dim3][dim4]> D(dim1);
Array<float[dim2][dim3][dim4][dim5]> E(dim1);
int main(int argc, char* argv[])
{
  for(__int64 i=0; i<dim1; ++i) {
    A = 0;
    for(__int64 j=0;j<dim2;++j) {
      B = 0;
      for(__int64 k=0;k<dim3;++k){
        C=0;
        for(__int64 l=0;l<dim4;++l){
          D=0;
          for(__int64 m=0;m<dim5;++m){
            E=0;
              if(k+l+m==0)
                std::cout<<i<<" "<<j<<" "<<k<<" "<<l<<" "<<m<<" "<<E<<"\n";
          }
        }
      }
    }
  }
  return 0;
}

The first index is the one enclosed in the (dim1) of the declaration. IOW

Array <float[dim_j][dim_k][dim_l][dim_m]> E(dim_i);
...
E = 0;

Note, E returns a float[dim_j][dim_k][dim_l][dim_m].

This then orders the (apparent) indexes i to m from left to right.

Jim Dempsey