I have a code dominated by 3 intrinsics: TRANSPOSE, MAXVAL, and NORM2. the arguments are large arrays/vectors. I have 36 or more cores at our disposal.
First, am I correct in assuming these are potentially vectorized but not threaded by default?
I am considering writing my own replacements for these with nested loops and applying appropriate OMP PARALLEL and OMP SIMD directives. However it would be nice to find threaded versions of these. In MKL maybe?