Pleaseensure you use vdPowx (vs. vdPow). Powx is intended for raising vector elements to a constant power (0.23 in your case). That should significantly reduce both memory footprint and pressure on the memory subsystem.
I think high pressure on the memory subsystem is the main reason why you see worse performance. There was good suggestion to segment input/output vectors in chunks to ensure results fit into the cache. (Using chunks of a few thousand elements should be fine).
Default math library accuracy in the compiler is equivalent to MKL VML_LA. If you use vdPowx and VML_LA plus vector blocking then I would expect the MKL VML performanceis at least on par with what you see in Fortran.