- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My idea is simple. I need to compile the following program anddirectly give the results (intArray) if it is run.
for(int x = 0; x < 10; x++)
for(int y = 0; y < 100; y++)
for(int z = 0; z < 10000; z++)
intArray[pos] =(x + y + z);
In my code, pos (i.e. the position in intArray to store the value x+y+z) will first be calculated, so don't worry about this. I want to parallelize my code using tbb::task (this portion is omitted, as it's much similar to the Example treeSum for tbb). Basically the position in intArray for each thread to store the value is independent from each other. However, to ensure thread safety I use std::vector<:ATOMIC> > as the data structure.
My question is the perfromance speedup between parallel version and serial version is not as expected. And I don't know why. As I know, atomic is very fast (When using this std::vector<:ATOMIC> >, I got almost twice speedup compared with tbb::concurrent_vector), and should have better performance speedup, at least the same. But the speedup is only aroud 1.1X for 2 threads(btw, when using tbb::concurrent_vector, the parallel version is almost 1.8X compared with serial version).
Can anyone help me? Thank you very much.
for(int x = 0; x < 10; x++)
for(int y = 0; y < 100; y++)
for(int z = 0; z < 10000; z++)
intArray[pos] =(x + y + z);
In my code, pos (i.e. the position in intArray to store the value x+y+z) will first be calculated, so don't worry about this. I want to parallelize my code using tbb::task (this portion is omitted, as it's much similar to the Example treeSum for tbb). Basically the position in intArray for each thread to store the value is independent from each other. However, to ensure thread safety I use std::vector<:ATOMIC>
My question is the perfromance speedup between parallel version and serial version is not as expected. And I don't know why. As I know, atomic is very fast (When using this std::vector<:ATOMIC>
Can anyone help me? Thank you very much.
Link Copied
15 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - yanling_zhi
My idea is simple. I need to compile the following program anddirectly give the results (intArray) if it is run.
for(int x = 0; x < 10; x++)
for(int y = 0; y < 100; y++)
for(int z = 0; z < 10000; z++)
intArray[pos] =(x + y + z);
In my code, pos (i.e. the position in intArray to store the value x+y+z) will first be calculated, so don't worry about this. I want to parallelize my code using tbb::task (this portion is omitted, as it's much similar to the Example treeSum for tbb). Basically the position in intArray for each thread to store the value is independent from each other. However, to ensure thread safety I use std::vector<:ATOMIC> > as the data structure.
My question is the perfromance speedup between parallel version and serial version is not as expected. And I don't know why. As I know, atomic is very fast (When using this std::vector<:ATOMIC> >, I got almost twice speedup compared with tbb::concurrent_vector), and should have better performance speedup, at least the same. But the speedup is only aroud 1.1X for 2 threads(btw, when using tbb::concurrent_vector, the parallel version is almost 1.8X compared with serial version).
Can anyone help me? Thank you very much.
for(int x = 0; x < 10; x++)
for(int y = 0; y < 100; y++)
for(int z = 0; z < 10000; z++)
intArray[pos] =(x + y + z);
In my code, pos (i.e. the position in intArray to store the value x+y+z) will first be calculated, so don't worry about this. I want to parallelize my code using tbb::task (this portion is omitted, as it's much similar to the Example treeSum for tbb). Basically the position in intArray for each thread to store the value is independent from each other. However, to ensure thread safety I use std::vector<:ATOMIC>
My question is the perfromance speedup between parallel version and serial version is not as expected. And I don't know why. As I know, atomic is very fast (When using this std::vector<:ATOMIC>
Can anyone help me? Thank you very much.
If "the position in intArray for each thread to store the value is independent from each other", then you do not need any additional protection. Use std::vector
You need to "ensure thread safety" iff threads make conflicting accesses to some data.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Dmitriy Vyukov
If "the position in intArray for each thread to store the value is independent from each other", then you do not need any additional protection. Use std::vector
You need to "ensure thread safety" iff threads make conflicting accesses to some data.
Uh, actually I'm doing a test before parallelizing a large project. What can be guaranteed is that there are little conflicts but not none.So I need to protect the data. During my test I find thatfor the serial version, using std::vector
What's more, I just modify the test code as you told me by using std::vector
Now I'm thinking that the way of parallelization is not very good.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A concurrent_vector does not protect against races regarding any given element (this is still your responsibility), so concurrent_vector is not a proper substitute for vector > in this case; also, there will be some overhead for the privilege of changing the collection while having references to elements remain valid (is this the "almost twice speedup" that you mention?), so if its size doesn't change during processing, this is not the type you should normally use. On x86, reading from and writing to an atomic typically carry no penalty compared to the same operations on the non-atomic type.
I don't see why a vector would not allow you to scale when a concurrent_vector does (the example of the concurrent_vector seems to rule out sharing issues). How about a plain dynamically allocated array, to rule out any quirks in the std::vector implementation? While you're at it, also try a version that only accesses every 16th or so element of the array, to really rule out sharing issues. If that doesn't help, there must be an issue related to the algorithm (unless I overlooked anything?).
I don't see why a vector would not allow you to scale when a concurrent_vector does (the example of the concurrent_vector seems to rule out sharing issues). How about a plain dynamically allocated array, to rule out any quirks in the std::vector implementation? While you're at it, also try a version that only accesses every 16th or so element of the array, to really rule out sharing issues. If that doesn't help, there must be an issue related to the algorithm (unless I overlooked anything?).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - yanling_zhi
Uh, actually I'm doing a test before parallelizing a large project. What can be guaranteed is that there are little conflicts but not none.So I need to protect the data. During my test I find thatfor the serial version, using std::vector> lose little performance compared with using std::vector. Hece whether usingatomic or not is not the key.
What's more, I just modify the test code as you told me by using std::vector to replace std::vector >, yet the speedup is still 1.1X. This is really strange.
Now I'm thinking that the way of parallelization is not very good.
What's more, I just modify the test code as you told me by using std::vector
Now I'm thinking that the way of parallelization is not very good.
Ok, I see. Then the problem is most likely - false sharing.
If two variables are situated in the same cache-line, which is most likely the case here:
int data [2];
Then if thread 1 writes to data[0] and thread 2 writes to data[1], this incurs tremendous overheads (hundrends of cycles).
Processor manages data on the cache line basis rather than on high-level language variable level. So you have to ensure than data accessed by different threads is situated in different cache lines (size of cache line is 64 bytes on modern Intel hardware).
So, figuratively, instead of:
struct data_t
{
int data_for_thread_1;
int data_for_thread_2;
int data_for_thread_3;
int data_for_thread_4;
};
data_t data [100000];
you better have:
int data_for_thread_1 [100000];
int data_for_thread_2 [100000];
int data_for_thread_3 [100000];
int data_for_thread_4 [100000];
This will ensure that different threads do not work on consecutive elements (corner effects aside for a moment).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Dmitriy Vyukov
Ok, I see. Then the problem is most likely - false sharing.
The other possible problem is memory bus saturation. Cores share memory bandwidth. So if 1 thread saturates memory bus, then you can't get speedup from 2 threads.
You may try to calculate total bandwidth that your program causes. Note that even if you only write to a memory location, processor has to load the cache line first. And then compare obtained value with hardware declared bandwidth.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Dmitriy Vyukov
The other possible problem is memory bus saturation. Cores share memory bandwidth. So if 1 thread saturates memory bus, then you can't get speedup from 2 threads.
You may try to calculate total bandwidth that your program causes. Note that even if you only write to a memory location, processor has to load the cache line first. And then compare obtained value with hardware declared bandwidth.
Maybe non-temporal store SSE instructions can help in such situation. They do not load data to cache before write.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Dmitriy Vyukov
Ok, I see. Then the problem is most likely - false sharing.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Dmitriy Vyukov
The other possible problem is memory bus saturation. Cores share memory bandwidth. So if 1 thread saturates memory bus, then you can't get speedup from 2 threads.
You may try to calculate total bandwidth that your program causes. Note that even if you only write to a memory location, processor has to load the cache line first. And then compare obtained value with hardware declared bandwidth.
Thank you Dmitriy. Yoursuggestions arevery useful. As to the former possible problem, i.e. false sharing, I have tried to modify the code using vector
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Raf Schietekat
A concurrent_vector does not protect against races regarding any given element (this is still your responsibility), so concurrent_vector is not a proper substitute for vector > in this case; also, there will be some overhead for the privilege of changing the collection while having references to elements remain valid (is this the "almost twice speedup" that you mention?), so if its size doesn't change during processing, this is not the type you should normally use. On x86, reading from and writing to an atomic typically carry no penalty compared to the same operations on the non-atomic type.
I don't see why a vector would not allow you to scale when a concurrent_vector does (the example of the concurrent_vector seems to rule out sharing issues). How about a plain dynamically allocated array, to rule out any quirks in the std::vector implementation? While you're at it, also try a version that only accesses every 16th or so element of the array, to really rule out sharing issues. If that doesn't help, there must be an issue related to the algorithm (unless I overlooked anything?).
I don't see why a vector would not allow you to scale when a concurrent_vector does (the example of the concurrent_vector seems to rule out sharing issues). How about a plain dynamically allocated array, to rule out any quirks in the std::vector implementation? While you're at it, also try a version that only accesses every 16th or so element of the array, to really rule out sharing issues. If that doesn't help, there must be an issue related to the algorithm (unless I overlooked anything?).
Actually when using concurrent_vector
My point is,when using the same data structure (concurrent_vector
And Raf Schietekattold me that concurrent_vector doesn't protect against data races. I checked the source code of concurrent_vector and the documents, the operator[] is only thread-safe for reading but not for writing. So again Iwill not consider concurrent_vector as a choice.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - yanling_zhisina.com
Thank you Dmitriy. Yoursuggestions arevery useful. As to the former possible problem, i.e. false sharing, I have tried to modify the code using vector
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Dmitriy Vyukov
Maybe non-temporal store SSE instructions can help in such situation. They do not load data to cache before write.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Dmitriy Vyukov
The other possible problem is memory bus saturation. Cores share memory bandwidth. So if 1 thread saturates memory bus, then you can't get speedup from 2 threads.
You may try to calculate total bandwidth that your program causes. Note that even if you only write to a memory location, processor has to load the cache line first. And then compare obtained value with hardware declared bandwidth.
According to my code, the size of vector
Another question is that will the compiler optimize the writing instructions to not using cache? Now I'm trying bypassing the cache.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ok, I see what's the problem.
First of all, in your code no two threads store to the same location, so I used just std::vector.
Speedup on my quad-core machine is 1.77.
When I changed computation from x = v1 + v2, to x = (int)((v1 / (v2 * 3.0 + 1.0)) / (v1 * 2.0 + 3.0)), i.e. made it more complex (note division of floating point values), I've obtained speedup 3.92. This suggests that the problem is memory bus saturation.
There is not much you can do with this.
You may try to use smaller data type, i.e. short instead of int, if your problem allows.
Or if you have 1 memory intensive phase and 1 computation intensive phase, you may try to combine them in order to "dilute" memory operations.
First of all, in your code no two threads store to the same location, so I used just std::vector
Speedup on my quad-core machine is 1.77.
When I changed computation from x = v1 + v2, to x = (int)((v1 / (v2 * 3.0 + 1.0)) / (v1 * 2.0 + 3.0)), i.e. made it more complex (note division of floating point values), I've obtained speedup 3.92. This suggests that the problem is memory bus saturation.
There is not much you can do with this.
You may try to use smaller data type, i.e. short instead of int, if your problem allows.
Or if you have 1 memory intensive phase and 1 computation intensive phase, you may try to combine them in order to "dilute" memory operations.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've applied non-temporal store instructions, and they improve situation somehow. You may try to use _mm_store_si32() intrinsic:
With non-tempotal stores I've obtained 2.28 speedup, instead of 1.77 with plain stores.
[cpp]#include__forceinline void compute(int& d, int v1, int v2) { //int v = (int)((v1 / (v2 * 3.0 + 1.0)) / (v1 * 2.0 + 3.0)); int v = v1 + v2; _mm_stream_si32(&d, v); //d = v; } [/cpp]
With non-tempotal stores I've obtained 2.28 speedup, instead of 1.77 with plain stores.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Dmitriy Vyukov
Ok, I see what's the problem.
First of all, in your code no two threads store to the same location, so I used just std::vector.
Speedup on my quad-core machine is 1.77.
When I changed computation from x = v1 + v2, to x = (int)((v1 / (v2 * 3.0 + 1.0)) / (v1 * 2.0 + 3.0)), i.e. made it more complex (note division of floating point values), I've obtained speedup 3.92. This suggests that the problem is memory bus saturation.
There is not much you can do with this.
You may try to use smaller data type, i.e. short instead of int, if your problem allows.
Or if you have 1 memory intensive phase and 1 computation intensive phase, you may try to combine them in order to "dilute" memory operations.
First of all, in your code no two threads store to the same location, so I used just std::vector
Speedup on my quad-core machine is 1.77.
When I changed computation from x = v1 + v2, to x = (int)((v1 / (v2 * 3.0 + 1.0)) / (v1 * 2.0 + 3.0)), i.e. made it more complex (note division of floating point values), I've obtained speedup 3.92. This suggests that the problem is memory bus saturation.
There is not much you can do with this.
You may try to use smaller data type, i.e. short instead of int, if your problem allows.
Or if you have 1 memory intensive phase and 1 computation intensive phase, you may try to combine them in order to "dilute" memory operations.
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page