- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In many CFD applications, a speed up of 3 for 4 threads would be considered excellent. If you actually had false sharing, your results would likely be much worse. As long as each thread is storing data at least 2 cache lines away from data in use by other threads, you don't actually have false sharing. You could be in a less severe butanalogous situation if a thread writes into and invalidates a cache line which will soon be neededon another cache.
The most severe such situation, sort of actual false sharing,would be where hardware prefetch is hindered by another thread writing into a cache line and invalidating it after it is prefetched. That might require threads to stay 4 or more cache lines apart, to make hw prefetch into an advantage rather than a hindrance.
Are you using KMP_AFFINITY (for Intel OpenMP; for gnu OpenMP there is GOMP_AFFINITY)? If you don't use a method to attempt to keep your threads properly sorted on the cores, you could easily lose 10% performance. I think the affinity schemes are more effective on linux than on Windows; if you can use a linux release such as EL4.4, CentOS 4.4 or newer, it may be worth while with the goals you have set.
With VTune, you could collect bus events and cache misses, to see if those are implicated in limiting your scaling. Intel Thread Profiler alsomight shed light on it. If you have balanced the work among the threads, applied affinity, but still have time spend waiting for the laggard thread to complete, it could be the slower threads are spending extra time moving data which is dirty in one cache when required in another. In the more intractable cases, OpenMP dynamic scheduling with optimization of chunk sizemay help overcome such delays, but you might have no chance of making your goal in such a case.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I made new tests today on a 4 duo core opteron machine. I set affinity mask and indeed it improves performances a bit. Unfortunately there is no speed up any more when using more than 4 processors. As I'm doing DNS computation, scalability is critical as grids are huge and computational time as well. (I need at least a linear speed up ratio)
That's why I want to use openmp in "mpi like" mode ( n threads from the beginning till the end).
From vtune I sampled some L2 cache lock events. It appears that there is much more (100-1000 times) events in the multithreaded case. Those L2 locked access are always in the "exclusive state".
I am not a specialist at all in computer architecture and I don't really know how to interpret this. Can you help me?
Thanks a lot!
PS: my code is really simple:
- a 128^3 grid:
- 128^2/n_thread fft in one direction (length=128) for each thread.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Your use of the MKL FFT is fine. When you set up the calls to the FFT functions, what is the number of threads you use? Since you are threading yourself, and you don't want MKL to thread the FFTs (and which should, in general, give you more performance) the number of threads should be set to 1. If that is the case and you are still getting the locks, we would like to know about this.
Bruce
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I think I finally tried all possible methods to thread my problem. Unfortunately, even with the simplest case, the speed up does not appear to be good enough. Indeed I only reach a speed up ratio of 5.5 on 8 processors with relatively big cases. This might look not bad, except when your simulation could run one month on 32 processors.
I'm still getting some lock when looking with vtune, but I can't say if the number of occurency is significant or not.
I invite anyone who wants to help me to see the source code of my simple test, the script and the compilation option I was using on the following link.
Thanks a lot!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In general, 5.5/8 on FFT is quite good. Memory reuse is problematic on the FFT since there are O( n log(n) ) computations on n data points. Reuse thus is on the order of log(n) which is quite small. Compare that to matrix multiply where there are O(n^3) operations on O(n^2) data references (amount of data moved into cache), for a reuse factor of n.
Bruce
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can it be due to opteron architecture?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I understand your environment, but my statement still holds. Since the number of operations is ~5*n*log(n), the reuse of data is small which limits the amount of speedup that you can get. Running individual FFTs on each thread eliminates the issues with cooperative parallelization of the code, but does nothing to improve the ratio of arithmetic operations per data point.
Bruce
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So if I understand well, there is too many memory access in this problem due to the small amount of data reuse as you said.
The problem is then closed... Thanks anyway for helping me
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When reuse of data is large, how do multiple-threads improve the ratio of arithmetic operations per data point? Do the threads somehow share the data in cache?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
According to the MKL manual, FTTs can be used in 3 ways:
1. You do not create threads and let the FFT do it.
2. You create threads yourself but the number of threads using each FFT descriptor is one.
3. You want one FFT to be used by several user threads.
In the latter case you need to set DFTI_NUMBER_OF_USER_THREADS to the actual number of threads the FFT wil be used with.
However my module is a plug-in to a 3rd party host application, which generates the threads for me and call my FFT function. I am unable to know in advance how many threads this host-app will generate on my FFT. Trying to set this attribute to arbitrary numbers like 2 or 8 crashes the MKL on an 8-core mahcine upon application quit.
If I set the number of user threads to 1, which is the defaults, I get extreme degradation in performance.
Any Idea?
Thanks,
Itai

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page