The performance of Multiple Monte Carlo processes with large numpy.array manipulation is too bad

Gang_Z_ · ‎12-28-2017

I found that

1.ONE time Monte Carlo, in which there will be large numpy.array such as 500000 elements. it performs slow.

however, To be more surprise,

2.if MULTIPLE Monte Carlo processes the same process with above 1 will cost more time than only one,about 10 times time cost.

Does the memory allocation make the process significantly slower？

Anyone have idea about this situation to improve performance?

Thanks in advance.:)

Oleksandr_P_Intel · ‎01-08-2018

Any snippet of code to work with? Relation of the size of array to the amount of memory you have?

There is too little information to offer any useful suggestion.

Gang_Z_ · ‎07-19-2018

Oleksandr P. (Intel) wrote:

Any snippet of code to work with? Relation of the size of array to the amount of memory you have?

There is too little information to offer any useful suggestion.

we firstly will generator random numpy ndarray containing 500000 float elements with norm distribution, then it will be as scenario analysis input for Monte Carlo method, within there are much matrix operations, resulting large ndarray intermediate result. Finally it will output the final result.

This will be loop almost Hundreds of times or mutlprocessing.map_async

The time cost will be incredible large... but no loop only one time monte carlo. it will be very quick.The time cost of the former is not hundreds of times * T(1), but far more

I found something

Multiprocessing is intrinsically costly because of the Global Interpreter Lock, which prevents multiple native threads from simultaneously executing the same Python bytecode. multiprocessing works around this limitation by spawning a separate Python interpreter for every worker process, and using pickling to send arguments and return variables to and from the workers. Unfortunately this entails a lot of unavoidable overhead.
If you absolutely must use multiprocessing, it's advisable to do as much work as possible with each process in order to minimize the relative amount of time spent spawning and killing processes. For example, if you're processing chunks of a larger array in parallel then make the chunks as large as possible, and to do as many processing steps as you can in one go, rather than looping over your array multiple times.
In general, though, you will be much better off doing your multithreading in a lower-level language that isn't limited by the GIL. For simple numerical expressions, such as your example, numexpr is a very simple way to achieve a significant performance boost (~4x, on an i7 CPU with 4 cores and hyperthreading). As well as implementing parallel processing in C++, a more significant benefit is that it avoids allocating memory for intermediate results, and thus makes more efficient use of caching.