The fastest and simplest combination I found is -Ofast -static. And I also tried -mtune=haswell and Profile Guided Optimization, but it didn't improve the computation time.
Is there anything else I can do in terms of compilation options?
You can try -fast as a simple shortcut - this sets several options that usually (but not always) improve computation performance. Be aware that this implies -xHost which optimizes for the CPU you compiled on. If you'll be running the program elsewhere, don't use either of these.