Please read the document I refered to above. You're getting these results because Atom is terrible at x87, not because Sandy Bridge is bad at SSE.
The reason you're only getting a 3% speedup with double-precision SSE on Sandy Bridge is apparently due to the vectorization overhead. Typical code requires a significant number of shuffle, insert and extract instructions to organize the data. Future architectures may alleviate this by adding gather/scatter support (i.e. packed load/store which doesn't occupy port 0/1/5 slots).