I want to benchmark an implementation in cluster architectures. The number of processes need not be high, however I need as much memory as possible on each MPI process. For example, I have access to a large cluster where each node has two sockets and each socket has 6 multithreaded cores. Assume that I launch one MPI process per node (or per socket). Can I make this particular MPI process to access the entire node's (socket's) memory? Right now I can launch a single MPI process per socket but the amount of memory that the process sees is only the that of a single core.
I wonder why you get the strange behavior with MPI on your nodes. A single MPI process should be able to access the whole memory unless you did restrict it by using numactl. Do you have a reproducer for this behavior? How much memory were you able to allocate? How much memory was available (shown by e.g. top).