There are two areas that will cause this effect.
1) Memory access latencies.
An HT p4 only has one memory port so to speak. If the processor is not making use of it's cache system then both threads are waiting for memory I/O.
2) A p4 with HT has only 1 FPU (Floating Point Unit), but has two integer units and address calculation units.
If the two threads are predominantly FPU bound then you are experiencing an FPU bottleneck problem.
On my P4 with HT and running a Finite Element simulation of a tether system depicting a Space Elevator I experience 20%-30% boost in performance when running two threads vs one thread. (Your milage may vary). On my 4 core server box (2x AMD Opteron 270's with two memory busses) I can attain over 350% boost in performance.
HT was a bit disappointing for me in terms of ROI on code change from single thread to multi-thread. Going to multi-core and/or multi-processor package is where your best ROI is found. Think of retiring your P4 with HT and replace it with a Dual Core or Quad Core processor.
Oh, I hope this doesn't post multiple times. Another post eaten by forum software.
I would be surprised if the problem weren't (1) above, i.e. memory access throughput (I assume you misspoke above because latency-limited code is exactly what HT is supposed to improve). The original poster specified an array rather than several arrays and his method of timing was likely coarse-grained so one might expect that his data set was big enough to reside in main memory rather than in L1 cache.
If the code has an FPU latency bottleneck rather than an FPU throughput bottleneck, HT can act almost like two real processors. Consider this table of results. For the form of the benchmark where the FPU was used, the computational pipelines were filled with bubbles so HT shows quite an improvement.The SSE2 form of the benchmark is seeing less than 50% improvement because of the low throughput of P4s due to their 64-bit wide computational units. I have a 64-bit version of the same (Mandelbrot) benchmark and I believe that as a P4 progresses from the more latency-limited (*X1*) versions to the throughput-limited (*X4*) versions that a P4 would show less and less advantage for HT, but alas there are no x64 machines on display at the store so I will never be able to test this hypothesis.
If the original poster were to rewrite his tests so that the array in question were small enough to fit in L1 cache and did a preliminary sweep so that all code and data were in fact in L1 data and in trace cache and changed his timing methodology so that a fine-grained timer such as the TSC were used I am confident that the code produced by a C++ compiler for this exercise would create so many pipeline bubbles (a pipeline froth in fact) that HT would be almost twice as fast. I wouldlike to see a disassembly of the inner loop if this were not the case.
HT can show big gains in dealing with certain types of memory latencies, particularly TLB misses.Those occur when instructions and date are scattered in memory. Nothing was said to indicate that wasso in this case.If, instead, it is a question of sufficient memory bandwidth to supportmultiple threads, the average latency will actually increase with number of threads.
HT may show gain when there is a mixture of arithmetic operations, including some which have a long latency, such as divide. As HT doesm'tprovide any additional FPU resources, it doesn't increase the ideal peak computation rate of effectively vectorized code.
Rather than guage how my application will run on HT by viewing benchmark comparisons I chose to guage how my application runs on HT by actually converting the application and running it. On a 4 core system without HT the multithreaded code runs at ~350% of one core on the same system. I have not run 2 cores on this system but my unscientific gut feel is 2 cores would run at ~175% (possibly more).My desktop system withone processor with HT and 2 threads gets ~30% (or less) better performance than with one thread.
Depending on how the simulation is setup it may have 1GB or moreof data that it manipulates. Some of the simulation data will fit in the cache (local temps) but the main arrays will likely not fit in cache.
For this real live application, HT was somewhat of a disappointment. The code is fairly well optimized as indicated by a 4 core (4 thread) attaining ~350% improvement over one. While it may be true that you can find a benchmark that attains a 4x improvement with 4 cores, real world applications typically will not scale in a linear fasion.