Solved: Re: MPI Coding

kmvinoth · ‎10-19-2009

Hi i am doing MPI coding in C++ using Intel compiler in a linux cluster.when i compile my program i got a warning message as "remark: Partial loop was vectorized".After googling for sometime i used # pragma novector command before the loop and i got rid of that warning message.but when i run the mpi program it is not giving me the expected answer.only certain processors in the nodes work while the others are not.given below is the output. I hope some one can help me out.thanks in advance.

Q from process 0 = 0.000000e+00
Q from process 2 = 0.000000e+00
Q from process 6 = 0.000000e+00
Q from process 4 = 0.000000e+00
Q from process 7 = 2.684355e+08
Q from process 1 = 2.684355e+08
Q from process 5 = 2.684355e+08
Q from process 3 = 2.684355e+08
Total Q = 1.073742e+09

Regards
vinoth

jimdempseyatthecove · ‎10-23-2009

Quoting - kmvinoth

jim

No,I dont know what is 36-bit PDP-10.i wrote the codefor 2^25 and it is running well without any problem based on that only i have gone for 2^36 where the problem occurs(Partial Loop was vectorized).

let me explain about my cluster(Vega Cluster, hpce.iitm.ac.in/website/vega.html)where i am running my code.we have 8 processors in each node and each processor is 64 bit and i am using 64 bit Intel C++ compiler to run my code.what more my code need to run successfully without any problem in the cluster.

Regards
vinoth

What is sizeof(int) on each system? If sizeof(int)==4 then 2^36 will not fit in an int variable.

Change the type of your indexing variable(s) and variables use for limits/countsto intptr_t (an int whos size is that of pointer).

BTW PDP-10 was built by Digital Equipment Corporation back in the late 1960's. These machines had 36-bit word size.

Jim Dempsey

View solution in original post

jimdempseyatthecove · ‎10-19-2009

Vinoth,

Check for uninitialized/unused variable (either input or output). If less than 8 threads are requried to perform work then you might see junk or NULL as result for uninitialized/unused variable.

Jim Dempsey

kmvinoth · ‎10-19-2009

Quoting - jimdempseyatthecove

Vinoth,

Check for uninitialized/unused variable (either input or output). If less than 8 threads are requried to perform work then you might see junk or NULL as result for uninitialized/unused variable.

Jim Dempsey

Jim

Its not clear to me what you said.I have also checked for uninitialized/unsed variable in my code and there is no uninitialized/unsed variable in my code. apart from that always the processors with even number like 0,2,4,6 are giving the answer as zero while the odd numbers are giving some answer.hope i made it clear

jimdempseyatthecove · ‎10-20-2009

I will type this slower so you can read this more easily...

Assume you have a parallel distribution point (e.g. parallel for) that is distributed across multiple machines using OpenMPI. However, also assume the iteration space for the parallel distribution point has fewer distributions than you have systems (threads if viewing as OpenMP). What do you expect for the results from the processors NOT scheduled?

Also, your configuration may be set up such that for processors withHT (Hyper Thread), only one of the siblings is scheduled. IOW if your iteration space is larger than number of "processors" only half get scheduled.

Therefore, if your setup is for each processor in your "system of systems" is to write back resultsto a mailbox, one slot per "processor" (HW thread), then for the processor(s) (thread(s))NOT scheduled, you will have no delivery into the mailbox. IOW the mailbox will have stale data (0, old value, or uninitialized data).

Jim Dempsey

kmvinoth · ‎10-20-2009

Quoting - jimdempseyatthecove

I will type this slower so you can read this more easily...

Assume you have a parallel distribution point (e.g. parallel for) that is distributed across multiple machines using OpenMPI. However, also assume the iteration space for the parallel distribution point has fewer distributions than you have systems (threads if viewing as OpenMP). What do you expect for the results from the processors NOT scheduled?

Also, your configuration may be set up such that for processors withHT (Hyper Thread), only one of the siblings is scheduled. IOW if your iteration space is larger than number of "processors" only half get scheduled.

Therefore, if your setup is for each processor in your "system of systems" is to write back resultsto a mailbox, one slot per "processor" (HW thread), then for the processor(s) (thread(s))NOT scheduled, you will have no delivery into the mailbox. IOW the mailbox will have stale data (0, old value, or uninitialized data).

Jim Dempsey

hi Jim

I understood what you said, but it has not solved my problem.my question is how to ovecome that issue and get rid of the problem.

Regards
vinoth

jimdempseyatthecove · ‎10-21-2009

Vinoth,

Insert code to have each process write diagnostic information that helps track down the problem.e.g.

are all the processes actually called?
is each process called with the input data you assume they are being called with?
are the results produced within the process (and diagnosticly stored)the results you expect?
are (all of)the results produced within the process returned to the controlling process?
is the controlling process waiting for all of the results?

When you discover the problem I anticipate a "Eurika moment".

Good luck hunting. I am sure it is just a small oversight.

Jim

kmvinoth · ‎10-22-2009

Quoting - jimdempseyatthecove

Vinoth,

Insert code to have each process write diagnostic information that helps track down the problem.e.g.

are all the processes actually called?
is each process called with the input data you assume they are being called with?
are the results produced within the process (and diagnosticly stored)the results you expect?
are (all of)the results produced within the process returned to the controlling process?
is the controlling process waiting for all of the results?

When you discover the problem I anticipate a "Eurika moment".

Good luck hunting. I am sure it is just a small oversight.

Jim

jim

i have given my code for your reference to track down the problem.each processor should give the answer as 8589934592 andthe final value of Q(Total Partition function) should be 6.871947674*10^10.i am also trying to discover the problem.

jimdempseyatthecove · ‎10-22-2009

Not knowing the systems in your cluster...

Are the even numbered systems 32-bit?

# define N 36
# define SIZE pow(2,N)
...
int i,n,p,tp,sv,ev;
int a[36];
n=(int)SIZE; // *** overflows on 32-bit system

Jim Dempsey

jimdempseyatthecove · ‎10-22-2009

??
Was this code originaly written on 36-bit PDP-10?

Jim Dempsey

kmvinoth · ‎10-23-2009

Quoting - jimdempseyatthecove

??
Was this code originaly written on 36-bit PDP-10?

Jim Dempsey

jim

No,I dont know what is 36-bit PDP-10.i wrote the codefor 2^25 and it is running well without any problem based on that only i have gone for 2^36 where the problem occurs(Partial Loop was vectorized).

let me explain about my cluster(Vega Cluster, hpce.iitm.ac.in/website/vega.html)where i am running my code.we have 8 processors in each node and each processor is 64 bit and i am using 64 bit Intel C++ compiler to run my code.what more my code need to run successfully without any problem in the cluster.

Regards
vinoth

jimdempseyatthecove · ‎10-23-2009

Quoting - kmvinoth

jim

No,I dont know what is 36-bit PDP-10.i wrote the codefor 2^25 and it is running well without any problem based on that only i have gone for 2^36 where the problem occurs(Partial Loop was vectorized).

let me explain about my cluster(Vega Cluster, hpce.iitm.ac.in/website/vega.html)where i am running my code.we have 8 processors in each node and each processor is 64 bit and i am using 64 bit Intel C++ compiler to run my code.what more my code need to run successfully without any problem in the cluster.

Regards
vinoth

What is sizeof(int) on each system? If sizeof(int)==4 then 2^36 will not fit in an int variable.

Change the type of your indexing variable(s) and variables use for limits/countsto intptr_t (an int whos size is that of pointer).

BTW PDP-10 was built by Digital Equipment Corporation back in the late 1960's. These machines had 36-bit word size.

Jim Dempsey

jimdempseyatthecove · ‎10-23-2009

I forgot to add that your code may have worked by accident dependent on content of data immediately following the indexing variables of your loop.

Jim

TimP · ‎10-23-2009

Quoting - jimdempseyatthecove

What is sizeof(int) on each system? If sizeof(int)==4 then 2^36 will not fit in an int variable.

Change the type of your indexing variable(s) and variables use for limits/countsto intptr_t (an int whos size is that of pointer).

BTW PDP-10 was built by Digital Equipment Corporation back in the late 1960's. These machines had 36-bit word size.

36-bit machines, such as Honeywell 6000, in the early 80's, had 9-bit bytes, so sizeof(int) == 4. Most C programmers had moved on to 32-bit machines by then. We didn't even have a C compiler for it.

kmvinoth · ‎10-24-2009

Quoting - tim18

36-bit machines, such as Honeywell 6000, in the early 80's, had 9-bit bytes, so sizeof(int) == 4. Most C programmers had moved on to 32-bit machines by then. We didn't even have a C compiler for it.

Jim

Thank you very much for your useful suggestion for the past one week on this problem.I changed the data type of the variables as suggested by you and it works well and good now(really a eurika moment).I have one more question what is the maximum number i can go. can i go for 2^256?

Regards
vinoth

jimdempseyatthecove · ‎10-26-2009

For systems with 64-bit words unsigned integer of 64 bits you have 2^64-1. If you are using signed int it is +2^63-1 maxdown to -2^63 min.

If you wish to express integers of larger size you can create a class with operators for this purpose. There are some available on the web. Here is one link http://www.codeproject.com/KB/cpp/lint.aspx

How do you intend to use a number (variable) containing 2^256?
If you are only interested in the power 2 is raised to then consider holding n of 2^n instead of the result.

Jim Dempsey

kmvinoth · ‎10-27-2009

Quoting - jimdempseyatthecove

For systems with 64-bit words unsigned integer of 64 bits you have 2^64-1. If you are using signed int it is +2^63-1 maxdown to -2^63 min.

If you wish to express integers of larger size you can create a class with operators for this purpose. There are some available on the web. Here is one link http://www.codeproject.com/KB/cpp/lint.aspx

How do you intend to use a number (variable) containing 2^256?
If you are only interested in the power 2 is raised to then consider holding n of 2^n instead of the result.

Jim Dempsey

jim

what you are saying is not clear to me.it will be much useful if you explain it with an example.

Regards
vinoth

jimdempseyatthecove · ‎10-27-2009

Vinoth,

In an 8-bit system

[cpp]00000001 = 2^0 = 1
00000010 = 2^1 = 2
00000100 = 2^2 = 4
00001000 = 2^3 = 8
00010000 = 2^4 = 16
00100000 = 2^5 = 32
01000000 = 2^6 = 64
10000000 = 2^7 = 128 (usigned) or -128 (signed)
11111111 = 2^7+2^6+2^5+2^4+2^3+2^2+2^1+2^0 = 255 (unsigned) or -1 (signed)


A 16-bit system would extend this 8 more bits,
A 32-bit ... more bits
A 64-bit ... more bits



void DoSomething()
{
  // ... your code to do something
}
// Do something 2^N times
void DoSomething(int N)
{
  if(N > 0) DoSomething(N-1);
  DoSomething();
}


Jim Dempsey[/cpp]

kmvinoth · ‎10-28-2009

Quoting - jimdempseyatthecove

Vinoth,

In an 8-bit system

[cpp]00000001 = 2^0 = 1
00000010 = 2^1 = 2
00000100 = 2^2 = 4
00001000 = 2^3 = 8
00010000 = 2^4 = 16
00100000 = 2^5 = 32
01000000 = 2^6 = 64
10000000 = 2^7 = 128 (usigned) or -128 (signed)
11111111 = 2^7+2^6+2^5+2^4+2^3+2^2+2^1+2^0 = 255 (unsigned) or -1 (signed)


A 16-bit system would extend this 8 more bits,
A 32-bit ... more bits
A 64-bit ... more bits



void DoSomething()
{
  // ... your code to do something
}
// Do something 2^N times
void DoSomething(int N)
{
  if(N > 0) DoSomething(N-1);
  DoSomething();
}


Jim Dempsey[/cpp]

jim

i got it.Thank you for your explaination jim.

regards
vinoth