- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm trying to speed up FFT processing on a dual 2 GHz Pentium 4 Xeon. When I use FFTW, which is fast C code available on the web it takes ~1 second to take 20000 1k FFTs. It takes ~.75 seconds using MKL 5.2 or 6.0, and .5 seconds using IPP. Why isn't the Intel microcode faster? Is is simply that their code has not been optimized for Xeon processors yet? It is disappointing to at best only get a factor of 2 using parallelized microcode. Also IPP doesn't seem to support either Fortran calls or parallelization, both of which I need. I've asked Intel's Premier support about this a couple of weeks ago but haven't received an explanation. In general how many FLOPS should you get per clock cycle on a Xeon? Isn't it 4, or 8 GFLOPS per processor? I'm certainly not seeing this using their microcode.
Link Copied
1 Reply
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Even if you didn't allow for memory access, the fastest possible rate for issue of SSE or SSE2 parallel multiplication instructions would be 1 per 2 clocks. Assuming that all the additions, memory accesses, and other operations could be fit in without slowing this down, you still would be taking at least 6 clocks for 10 flops, which looks to me like 3.3Gflops peak speed on 2Ghz. That has to be extremely optimistic. Even with algorithms which would allow for approaching 2 flops per clock, I haven't seen more than 60% of that in sustained performance.
The main point of MKL and IPP over compiler generated code is to organize more effective use of cache. Evidently, memory operations, which you and I have tried to ignore, are of equal importance.
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page