Solucionado: processing arrays under MPI

pardisdesigner · ‎07-23-2009

I have 2, 4 core Intel CPUs, => 8 nodes.
running mpiifort -O3 fortran 77 and 90

Overall I have 6 3D arrays that are constantly using each other to update themselves.
If their sizes are 100x100x100 (so that's 6,000,000 elements), if I run it using 1 cpu it takes 17 seconds.
If I run the identical job on 2 cpus it takes 21 seconds, same for 4 and 8 which give results of 38 and 73 seconds respectively.
My goal here is that if 1 cpu does the job in 17 seconds with 7 non-working cpus, I'd like to see 8 jobs be done with 8 working cpus in a time closer to 17 seconds.
I achieve this goal if I make my array size small enough. I made my arrays 10x10x10 and I get results 11 secods,11.5 secods, and 12.5 seconds using 1,4, and 8 processors.
This makes me believe that if the size of the array is small enough the array can be stored close by (maybe on cache) that helps with easy access of the data, so each node doesn't have to transfer much data. Now, if that's the case, is there anyway to increase this limit (if it's a software limit) to a point that even my 100x100x100 arrays can be stored in the same spot?
If that's not the case then what is the issue here?

Thanks

Ron_Green · ‎07-27-2009

Quoting - pardisdesigner

ok I'd like to update you with something.

I noticed that if I don't use any optimization, meaning using O0 instead of O3 or O2, I can go as high as 100x100x100 for my arrays and not notice any slowdown if I do one 100x100x100 with 1 cpu or eight 100x100x100 with 32 cpus.

It seems that the optimization is taking up some memory or cache space that my arrays could use. Is there any way to increase this space? Can you explain how and why the compiler behaves like this?

I'm awaiting a response.
Thank you

So at lower optimization you notice better scaling from 1 to 8 processors, correct? If so, I can explain this. I would guess that your code is communication bound. Consider the processing speed, computation, versus communication. Let's fix one of these variables, the speed of MPI communication. Now, let's assume that we slow the processor way down. You can see that overall you will spend considerably more time in computation. Now, given that communication is still fast, it's negligable compared to the computation. So as you scale the code, the amount of time computing goes down directly with with number of processes. In an ideal case, there is 100% computation and no communication and you can get linear speedup with scaling.

So for best scaling, your communication has to be small compared to computation.

On the other extreme, if you once again fix the communication speed and take the computation speed to infinity (make the code 100% communication bound), then you can see that scaling the code will lead to no benefit - in fact, the additional processes will block more frequently and you will see negative scaling.

In your case, Optimization speeds up that percentage of code that is computation. This again leaves you communication bound. What does Trace Collector/Analyzer show for your code? Perhaps a profile is in order.

There is no way for users to manually allocate cache - the key is to roughly figure it out on your own.

You can try the option -opt-streaming-stores always and -opt-streaming-stores never to see if this makes any difference. Streaming stores are preferred for large data movement when the data will not fit in cache.

ron

Ver solução na publicação original

pardisdesigner · ‎07-24-2009

ok I'd like to update you with something.

I noticed that if I don't use any optimization, meaning using O0 instead of O3 or O2, I can go as high as 100x100x100 for my arrays and not notice any slowdown if I do one 100x100x100 with 1 cpu or eight 100x100x100 with 32 cpus.

It seems that the optimization is taking up some memory or cache space that my arrays could use. Is there any way to increase this space? Can you explain how and why the compiler behaves like this?

I'm awaiting a response.
Thank you

Gergana_S_Intel · ‎07-24-2009

Quoting - pardisdesigner

It seems that the optimization is taking up some memory or cache space that my arrays could use. Is there any way to increase this space? Can you explain how and why the compiler behaves like this?

Hi pardisdesigner,

Since this seems to be related mostly to the compiler optimizations, I'll go ahead and transfer your question to the Intel Fortran Compiler forum. The experts there should be able to help.

Regards,
~Gergana

jimdempseyatthecove · ‎07-24-2009

If your 8 processors are configured in one SMP system, I suggest you consider OpenMP as opposed to MPI (OpenMPI). With MPI you are likely spending too much time copying memory and consuming your memory bandwidth.

>>I achieve this goal if I make my array size small enough. I made my arrays 10x10x10 and I get results 11 secods,11.5 secods, and 12.5 seconds using 1,4, and 8 processors.

1/1000th the data runs in 1/1.7th the time hardly a bargain. (588 times slower!)

OpenMP can eliminate the message passing

6 * 6,000,000 * 8 (real*8?) = 288MB

This can easily fit in 32-bit machine.
Jim

Ron_Green · ‎07-27-2009

Quoting - pardisdesigner

ok I'd like to update you with something.

I noticed that if I don't use any optimization, meaning using O0 instead of O3 or O2, I can go as high as 100x100x100 for my arrays and not notice any slowdown if I do one 100x100x100 with 1 cpu or eight 100x100x100 with 32 cpus.

It seems that the optimization is taking up some memory or cache space that my arrays could use. Is there any way to increase this space? Can you explain how and why the compiler behaves like this?

I'm awaiting a response.
Thank you

So at lower optimization you notice better scaling from 1 to 8 processors, correct? If so, I can explain this. I would guess that your code is communication bound. Consider the processing speed, computation, versus communication. Let's fix one of these variables, the speed of MPI communication. Now, let's assume that we slow the processor way down. You can see that overall you will spend considerably more time in computation. Now, given that communication is still fast, it's negligable compared to the computation. So as you scale the code, the amount of time computing goes down directly with with number of processes. In an ideal case, there is 100% computation and no communication and you can get linear speedup with scaling.

So for best scaling, your communication has to be small compared to computation.

On the other extreme, if you once again fix the communication speed and take the computation speed to infinity (make the code 100% communication bound), then you can see that scaling the code will lead to no benefit - in fact, the additional processes will block more frequently and you will see negative scaling.

In your case, Optimization speeds up that percentage of code that is computation. This again leaves you communication bound. What does Trace Collector/Analyzer show for your code? Perhaps a profile is in order.

There is no way for users to manually allocate cache - the key is to roughly figure it out on your own.

You can try the option -opt-streaming-stores always and -opt-streaming-stores never to see if this makes any difference. Streaming stores are preferred for large data movement when the data will not fit in cache.

ron