If your performance was reduced by turning on HT, running the same test with 2 threads, it doesn't look like a false sharing issue. To get an advantage from HT, you do usually need to increase the number of threads to match the number of logical processors. A significant reduction in performance is likely to be a scheduling problem. I don't know whether schedulers which work better with HT on dual CPU's are likely to come with distros incorporating 2.6 kernels.
As Tim pointed out, if you kept the same two threads when running under HT, the OS may have scheduled both threads onto the same physical processor (the two logical HT processors). This would result in a performance drop comapred to the dual-processor test without HT. Have you tired to run this with four threads on a dual-processor, HT-enabled system?