Solved: Maximum Vector Size in BRNG

Greg_Miller · ‎07-14-2011

VSL BRNGs appear to overflow the stack when the number of elements hits some limit when using the VisualStudio 2010 debugger. When calling the executable from the command line, this limit is somewhat higher, and when I run the release version of the code I can get to a higher limit. When switching from float to double, the limit is approximately in half.

So, before uploading an example of my code, let me ask: is there any documentation that provides an upper bound on the number of elements N in, for example,
vdRngUniform(VSL_RNG_METHOD_UNIFORM_STD,stream,N,xvec,0.0,1)?

For example, when running with the debugger, N=84954is fine, but N=84955 causes an unhandled exception.

Vamsi_S_Intel · ‎07-15-2011

Hi Greg,

As Sergey pointed out in his post, this issue is related to the program stack memory. In your program, the vector xvec[] isa local varaible and is allocated on thestack. When you give a large N value, it runs out of stack memory and causes segmentation fault. Your program runs fine as long as the vector sizeis within the stack size (ulimit -a shows thatthe default stack sizeis 10MB on my linux machine, I think this varies on a Windows machine). If you increase N by an order of magnitude,i.e, N=5000000 and with float data type, the size of the vector does not fit on the stack (5000000*4~=19MB). This also explains why it fails for half the number of elements when using double data type, as double occupies twice the memory as float.

You can modify your program so that the vector uses heap memory by using dynamic memory allocation ie., instead of
float xvec;
use
float *xvec = (float *) malloc (sizeof(float)*N);

View solution in original post

mecej4 · ‎07-14-2011

If, indeed, stack overflow is occurring, the upper bound that you seek would depend on how much stack is left unused at the time of call. That, in turn, would be dependent on the chain of calls on the stack; thus, this is not something that can be documented.

Greg_Miller · ‎07-14-2011

Thanks for prefacing your comment with "if, ineed, stack overflow is occurring, ...". The debugger error message may not be totally sufficient for finding the cause of a hang in either the debug or release version of the code run from the command line.

That said, here is a typical error I get when N is too large:
Unhandled exception at 0x5d3dcb17 (msvcr100d.dll) in MKL00b_84955.exe: 0xC00000FD: Stack overflow.

mecej4 · ‎07-14-2011

If the program is otherwise working fine and experiences stack overflow only once in a while, you may raise the permitted stack size by using the /stack:nnnnnn (where nnnnnn is the stack size in bytes) linker option.

If you have large local arrays, you may ask the compiler to allocate some on the heap instead of the stack, using the /heap-arrays[:size] option.

Sergey_M_Intel2 · ‎07-14-2011

Hi Greg,

It's a bit surprising that you face the stack overflow issue with VSL. Neither VSL function intensively uses stack, and I hardly imagine situation when VSL RNG is the source of the stack overflow problem.

While Idon't haveall detail to pin-point where exactly the source ofyour problem,my recommendation (as the first step) is to review the code to ensure that you don't doarray allocations (or other memory consuming operations) on the stack. For example, are you confident that xvec[] is allocated not on the stack?

Please let me know if this is not the case, and we will try to check other options,

Sergey

Greg_Miller · ‎07-15-2011

Hi Sergey,

As a rusty programmer, I don't know how to track stack and heap activity. However, below is a sample program that generates the error. I am running on a i7-980X with 24 GB of RAM under W7-64. The code is compiled as a WIN32 console app using the Intel compiler in VS2010. If you compile this, you may find that playing with the value of N will be needed to crash the app.

Thanks,

Greg

Here's the code sample.

#include

#include "mkl_vsl.h"

#define SEED 777

#define BRNG VSL_BRNG_SFMT19937

#define METHOD VSL_RNG_METHOD_UNIFORM_STD

#define N 500000

int main()

{

float xvec;

VSLStreamStatePtr stream;

vslNewStream( &stream, BRNG, SEED );

vsRngUniform( METHOD, stream, N, xvec, 0.0, 1.0 );

vslDeleteStream(&stream);

printf("N = %u, xvec[0] = %f, xvec[N-1] = %f\n",N,xvec[0],xvec[N-1]);

return 0;

}

Vamsi_S_Intel · ‎07-15-2011

Hi Greg,

As Sergey pointed out in his post, this issue is related to the program stack memory. In your program, the vector xvec[] isa local varaible and is allocated on thestack. When you give a large N value, it runs out of stack memory and causes segmentation fault. Your program runs fine as long as the vector sizeis within the stack size (ulimit -a shows thatthe default stack sizeis 10MB on my linux machine, I think this varies on a Windows machine). If you increase N by an order of magnitude,i.e, N=5000000 and with float data type, the size of the vector does not fit on the stack (5000000*4~=19MB). This also explains why it fails for half the number of elements when using double data type, as double occupies twice the memory as float.

You can modify your program so that the vector uses heap memory by using dynamic memory allocation ie., instead of
float xvec;
use
float *xvec = (float *) malloc (sizeof(float)*N);

Greg_Miller · ‎07-15-2011

Vamsi and Sergey, thank you both. Your answers explained the underlying issue and a workaround.

The explanation also leads to another question: Is this specific to the way the random number generator routines are implemented in the VSL, or will I need to consistently use dynamic memory allocation to avoid this problem for other functions?

I employed mecej's suggestion to increase the stack size and that worked, too.

-Greg

Andrey_N_Intel · ‎07-17-2011

Hi Greg,

Can you please clarify the reason for which you need to generate 0.5M random numbers at once?
As an alternative option you might want to try a different approach which would allow you to avoidissues withstack/dynamic memory allocation and probably improve perfromance of your application. It could be done in the following way.

#define N 2048 // size of the buffer which youneed to adjust for your application
#define M 500000 // total amount of random numbers

int main()
{
...
int i;
float xvec;
int nblocks, ntail;

vslNewStream( &stream, BRNG, SEED );
nblocks = M/N;
ntail = M - N * nblocks;

for ( i = 0; i < nblocks; i++ )
{
vsRngUniform( METHOD, stream, N, xvec, 0.0, 1.0 );

// process numbers available in buffer xvec
...
}

if ( ntail )
{

vsRngUniform( METHOD, stream, ntail, xvec, 0.0, 1.0 );

// process numbers available in buffer xvec
...

}
vslDeleteStream( &stream );

...

}

Similar topic was discussed earlier on the forum, in context of parallel random number generation. Please, have a look at http://software.intel.com/en-us/forums/showthread.php?t=83462&o=a&s=lr
Also for reference, please have a look at RNG performance data at http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm

Best,
Andrey

Greg_Miller · ‎07-18-2011

Hi Andrey,

The motivation for generating a large quantity of random numbers was to benchmark the performance of my system under various compiler settings and to check against the RNG performance data. I also wanted to observe CPU loading with Task Manager, and to do that it would be good for the code to take several 10s of seconds.

As my system is pretty fast, I found it difficult to measure computation time with regular timer code. (For some reason, the link to the Etimer.h code goes to a huge IT management API, which I downloaded in the hope of finding Etimer.h to no avail - if you can point me to it, I'd be grateful).

When I began to see the Stack Overflow message, I used smaller arrays and put theVSL line in a loop, similar to what you showed above. However, that was ineffective. The system would hang at the same quantity of random numbers regardless of how I got there.

With the popularity of MonteCarlo methods, I am a bit surprised that this problem occurs. One of the projects on my list is related to optical ray tracing, and it's very easy to get to where millions of random starting rays are required.

I will visit the forum you mentioned.

Thanks,

Greg

Andrey_N_Intel · ‎07-18-2011

Hi Greg,

Thanks for the clarifications. To understandperformance aspects of Intel MKL RNGs and their impact on your applicationI woulddevelop the benchmark by extending the code described above in my previous post in the following way:

__int64 t1, t2;

t1 = __rdtsc();
for ( k = 0; k {
for ( i = 0; i < nblocks; i++ )

{
vsRngUniform( METHOD, stream, N, xvec, 0.0, 1.0 );
// process numbers available in buffer xvec
...
}
...
}
t2 = __rdtsc();
...

By choosing number of iterations ITER_NUM you would
1.balance the load of your system
2. generate any amount of random numbers you need andwill not depend on stack/etc.Macro M which defines the total amount of random numbers is the additional option tocontrol the workload.
3. use caches more effectively.Presumably,thisis notthe casewhen you generate 0.5M numbers in one call of MKL RNG and then process them in the application. Also, to speed-up the computations it would make sense to use MKL RNGs in parallel mode by applying one of the parallelization methodologies (skip-ahead, leap-frog, family of BRNGs). And from this perspective,use of huge bufferswould also be ineffective.
4. have more clearpicture on contribution of random number generationintooverall performance of your code.

In the real simulations, of course, you wouldnot need to use "iterations" and would chooseamount of random numbers whichprovide required statistical accuracy of the simulations, millions or billions.With "blocking" approachyou will be able to do this and will not depend on stack size, etc.

I also wounder if rdtsc function which returns clocks and whose use is demonstrated above would work for your purposes?

Please, letus know if this clarifies thingsor you havemore questions.

Best regards,
Andrey

Sergey_M_Intel2 · ‎07-18-2011

Hi Greg,

Thank you! Your code example helps to root cause the issue. Indeed based on your code example I see that xvec is allocated on stack as it is local variable. Simplest thing youcan do is to add static attribute to xvec[]:

int main()
{

static float xvec;

...

In that case the allocation will be done in static memory.

This is not RNG specific. You can reproduce the stack overflow by playing with any other function call. As mecej4 indicated, the maximum stack size is fixed to certain amount. When you are close to the limit (this was your case since large amount of stack was used by xvec) then any function call may exceed the limit and lead to raising the Stack Overflow exception.

The brute force approach is to increase the stack limit. This may work in your case but there is no guarantee that when you call some function demanding for stack you may hit the limit again. General recommendation is to avoid allocation of large arrays on the stack by just adding static attribute to the array variable definition.

Also Andrey's suggestion to do blocking of the data will be useful. Optimal application performance is achieved when you try to keep the data in cache. Typical method for keeping data in cache is to use data blocking, e.g. by following Andrey's suggestion. And again, this generaloptimization hintrather thanRNG specific one.

I hope that helps.

Best regards,
Sergey