Your code shows that you are using different threads to initialize the arrays than the threads you use to do the computation. Why not parallelize the outer initialization loop so that the data is initialized in at least the same last-level cache as the one it will be operated on?
I also have an answer for one of your questions:
2) Why when compiling F90 code with OpenMP, are arrays not initialized in parallel?
Because it is not typically possible for the compiler to tell how the data should be split up among the threads on a NUMA system. (For your VERY simple example, it may be possible, but not for large applications in general.) The programmer should know this however and be able to do a good job (using parallel do with default static scheduling clause). You have to move the data around using MPI explicitly, which is the equivalent kind of optimization.
Finally, why not combine the paralleldo loops that do the calucation of a() and b() into a single parallel region?. (See !$omp parallel and !$omp do. I'm suggesting using one "!$ompparallel" with two "!$omp do loops inside".)
If you do these simple things, you may get a bit more speedup compared to the MPI version (assuming you are getting some cache re-use already). But if you are truly bandwidth limited, the only way to break through that performance barrier is to make better re-use of the caches. This often involves careful algorithm restructuring and requires significant amounts of work (like MPI).