For Option (1)
MKL works on threading. If the machines are SMP, then the job will scale to number of cores, the machine has. Ultimately your job will run on one machine and will have number of threads=number of cores. So clustering does not help you to get better peformance here.
For Option (2)
As you are using Scalapack and MPI here, the job will spread into multiple machines of the cluster. If machines are SMP boxes, you can use MPI + threading to get optimal performance. For example, if the cluster has 5 machines and each of them are an SMP of quadcore, then you can run the job with
(a) 5 mpi processes on 5 machines and 4 MKL threads per mpi process. Each machine will have 1 mpi process & 4 MKL threads.
(b) 10 mpi processes on 5 machines and 2 MKL threads per mpi process. Each machine will have 2 mpi process & 4 MKL threads.
Use the -machinefile option of mpirun/mpiexec to control number of mpi process on each machine and MKL_NUM_THREADS=<4 or 2> to control MKL threads.
So, I suggest you to go with Option (2) and evaluate the performance for both (a) & (b)