From my measurements, it looks like AVX2 auto-dispatch code is not called when I build a shared (dynamic) library on macOS. The resulting code size definitely differs with and without -axCORE-AVX2 enabled during compilation. Auto-dispatch is definitely working when producing standalone applications. I'm using MacBook Pro Early 2015 which features a processor with AVX2. Should I use some specific switch to enable auto-dispatch in dynamic libraries? I have icc 17.0.4. I use g++ to link the dynamic library due to mixed use of ObjectiveC. Anyway, standalone application is also linked with g++ yet auto-dispatch works there.
Maybe this isn't an issue with the compiler after all, I've tried to compile the whole application with -xCORE-AVX2, without auto-dispatch - there is no performance benefit in comparison to -xSSSE3 compilation. So, it's probably processor-related issue, which does not execute AVX2 code with enough performance. I have compiled a similar code on Windows 10 with AVX2 auto-dispatch, and it is around 40% more efficient than SSE3 code, on Intel i7-7700K processor.