You would require -ffast-math to include sum reduction vectorization. Gcc ought to handle this quite well. gcc 4.6 has cleaned up the list of aggressive optimizations under fast-math, so it's safer than similar optimization with icc or older gcc. If the compiler doesn't automatically perform scalar replacement on *s (it should, if you use __restrict pointers), it's simple enough to write that in your source code. As for a ridiculously short time; if the compiler can see that you never use the result of a loop, it may optimize it away. This kind of benchmark cheating optimization has been in high demand for decades.
I was assuming you were compiling with SSE option and asking gcc to vectorize (-O3). With -ffast-math -O3, gcc includes auto-vectorization of reductions such as you posted, so you would get within 60% of best possible performance for the loop, without changing your C source code. gcc options -ftree-vectorizer-verbose=n (n >=1) will give you some vectorization diagnostics. This is equivalent to -fast or #pragma simd reduction() auto-vectorization of icc, with respect to the source code you posted, except that icc will unroll more aggressively to get more performance in the middle range (loop length 100-2000).