I'm guessing this is more of a compiler options question than a threading question. As you haven't specified which compiler it is, the following may be more verbose than you intended: One of the issues I run into with C float (Fortran default real) is that the aggressive default optimizations of Intel compilers are more likely to bite when using SSE code, as there is no promotion of intermediate calculations to higher precision. For C code, you may need one of the /fp (linux -fp-model) options, one less aggressive than the default /fp:fast. For ifort, I often use -assume protect_parens -prec-div -prec-sqrt, all of which are included in -fp-model precise. The default optimizations reduce the range of validity of divide and sqrt; -prec-div and -prec-sqrt require those to be done according to IEEE, subject to whether you have gradual underflow enabled (/fp:precise sets /Qftz-). /fp:precise does not promote Fortran default real intermediates to double, while it does promote C float intermediates to double. I hope I have not further confused the issue beyond what is written in the compiler docs. /fp:precise also removes optimizations where numerical results may depend on data alignment. With Microsoft 32-bit C, if you set /arch:SSE2, you don't get good float performance unless you set /fp:fast (or possibly /fp:source), as the implicit float to double conversions of /fp:precise (their default) are expensive. So, good performance with SSE comes at the expense of the protection offered by extended range and precision evaluation. gcc -ffast-math (roughly equivalent to -fp-model fast of Intel compilers) has different reliability issues between x87 and SSE code.