It looks a big question. I may suggest you to start with MKL internal parallel.
As for most of case, MKL have explored the best parallel performance on multi-core based on your system configuration and problem size. If you call threaded MKL library, your application will get parallel automatically.
For example, you may try the pardiso first to see the performance change with export MKL_NUM_THREADS=1/2/4/8, also with command
> ifort -mkl your.f90
( I'm not sure how mpi process influence the MKL thread ,which is based on OpenMP)
Then if you really need parallelize your application yourself, you may need to learn all kind parallel method, typically, OpenMP as
and pThread on Linux.
+ threaded MKL library (-lmkl_intel_thread -lmkl_core -liomp5) .
You may search in the forum or mkl userguide. Here is one documentation about this for your reference.
People who are interested in cpu_time for parallel benchmarks usually consider an increase as a favorable result, using it along with the elapsed time (e.g. from system_clock) to calculate "concurrency" (the ratio of cpu time to elapsed time).
The new compiler feature !$omp parallel do simd is particularly hoggish in terms of making big increases in CPU time, on the assumption that enough threads will be used to make a reduction in elapsed time.
Hyperthreading enthusiasts don't always care even about a reduction in elapsed time; they simply like to see a large concurrency figure.