Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
5 Views

why didn't I see better performance speedup when using atomic

My idea is simple. I need to compile the following program anddirectly give the results (intArray) if it is run.
for(int x = 0; x < 10; x++)
for(int y = 0; y < 100; y++)
for(int z = 0; z < 10000; z++)
intArray[pos] =(x + y + z);
In my code, pos (i.e. the position in intArray to store the value x+y+z) will first be calculated, so don't worry about this. I want to parallelize my code using tbb::task (this portion is omitted, as it's much similar to the Example treeSum for tbb). Basically the position in intArray for each thread to store the value is independent from each other. However, to ensure thread safety I use std::vector<:ATOMIC> > as the data structure.
My question is the perfromance speedup between parallel version and serial version is not as expected. And I don't know why. As I know, atomic is very fast (When using this std::vector<:ATOMIC> >, I got almost twice speedup compared with tbb::concurrent_vector), and should have better performance speedup, at least the same. But the speedup is only aroud 1.1X for 2 threads(btw, when using tbb::concurrent_vector, the parallel version is almost 1.8X compared with serial version).
Can anyone help me? Thank you very much.
0 Kudos
15 Replies
Highlighted
Valued Contributor I
5 Views

Quoting - yanling_zhi
My idea is simple. I need to compile the following program anddirectly give the results (intArray) if it is run.
for(int x = 0; x < 10; x++)
for(int y = 0; y < 100; y++)
for(int z = 0; z < 10000; z++)
intArray[pos] =(x + y + z);
In my code, pos (i.e. the position in intArray to store the value x+y+z) will first be calculated, so don't worry about this. I want to parallelize my code using tbb::task (this portion is omitted, as it's much similar to the Example treeSum for tbb). Basically the position in intArray for each thread to store the value is independent from each other. However, to ensure thread safety I use std::vector<:ATOMIC> > as the data structure.
My question is the perfromance speedup between parallel version and serial version is not as expected. And I don't know why. As I know, atomic is very fast (When using this std::vector<:ATOMIC> >, I got almost twice speedup compared with tbb::concurrent_vector), and should have better performance speedup, at least the same. But the speedup is only aroud 1.1X for 2 threads(btw, when using tbb::concurrent_vector, the parallel version is almost 1.8X compared with serial version).
Can anyone help me? Thank you very much.

If "the position in intArray for each thread to store the value is independent from each other", then you do not need any additional protection. Use std::vector.
You need to "ensure thread safety" iff threads make conflicting accesses to some data.


0 Kudos
Highlighted
Beginner
5 Views

Quoting - Dmitriy Vyukov

If "the position in intArray for each thread to store the value is independent from each other", then you do not need any additional protection. Use std::vector.
You need to "ensure thread safety" iff threads make conflicting accesses to some data.



Uh, actually I'm doing a test before parallelizing a large project. What can be guaranteed is that there are little conflicts but not none.So I need to protect the data. During my test I find thatfor the serial version, using std::vector> lose little performance compared with using std::vector. Hece whether usingatomic or not is not the key.
What's more, I just modify the test code as you told me by using std::vector to replace std::vector >, yet the speedup is still 1.1X. This is really strange.
Now I'm thinking that the way of parallelization is not very good.
0 Kudos
Highlighted
Black Belt
5 Views

A concurrent_vector does not protect against races regarding any given element (this is still your responsibility), so concurrent_vector is not a proper substitute for vector > in this case; also, there will be some overhead for the privilege of changing the collection while having references to elements remain valid (is this the "almost twice speedup" that you mention?), so if its size doesn't change during processing, this is not the type you should normally use. On x86, reading from and writing to an atomic typically carry no penalty compared to the same operations on the non-atomic type.

I don't see why a vector would not allow you to scale when a concurrent_vector does (the example of the concurrent_vector seems to rule out sharing issues). How about a plain dynamically allocated array, to rule out any quirks in the std::vector implementation? While you're at it, also try a version that only accesses every 16th or so element of the array, to really rule out sharing issues. If that doesn't help, there must be an issue related to the algorithm (unless I overlooked anything?).
0 Kudos
Highlighted
Valued Contributor I
5 Views

Quoting - yanling_zhi
Uh, actually I'm doing a test before parallelizing a large project. What can be guaranteed is that there are little conflicts but not none.So I need to protect the data. During my test I find thatfor the serial version, using std::vector> lose little performance compared with using std::vector. Hece whether usingatomic or not is not the key.
What's more, I just modify the test code as you told me by using std::vector to replace std::vector >, yet the speedup is still 1.1X. This is really strange.
Now I'm thinking that the way of parallelization is not very good.

Ok, I see. Then the problem is most likely - false sharing.
If two variables are situated in the same cache-line, which is most likely the case here:
int data [2];
Then if thread 1 writes to data[0] and thread 2 writes to data[1], this incurs tremendous overheads (hundrends of cycles).

Processor manages data on the cache line basis rather than on high-level language variable level. So you have to ensure than data accessed by different threads is situated in different cache lines (size of cache line is 64 bytes on modern Intel hardware).

So, figuratively, instead of:

struct data_t
{
int data_for_thread_1;
int data_for_thread_2;
int data_for_thread_3;
int data_for_thread_4;
};

data_t data [100000];

you better have:

int data_for_thread_1 [100000];
int data_for_thread_2 [100000];
int data_for_thread_3 [100000];
int data_for_thread_4 [100000];

This will ensure that different threads do not work on consecutive elements (corner effects aside for a moment).

0 Kudos
Highlighted
Valued Contributor I
5 Views

Quoting - Dmitriy Vyukov
Ok, I see. Then the problem is most likely - false sharing.


The other possible problem is memory bus saturation. Cores share memory bandwidth. So if 1 thread saturates memory bus, then you can't get speedup from 2 threads.
You may try to calculate total bandwidth that your program causes. Note that even if you only write to a memory location, processor has to load the cache line first. And then compare obtained value with hardware declared bandwidth.

0 Kudos
Highlighted
Valued Contributor I
5 Views

Quoting - Dmitriy Vyukov

The other possible problem is memory bus saturation. Cores share memory bandwidth. So if 1 thread saturates memory bus, then you can't get speedup from 2 threads.
You may try to calculate total bandwidth that your program causes. Note that even if you only write to a memory location, processor has to load the cache line first. And then compare obtained value with hardware declared bandwidth.


Maybe non-temporal store SSE instructions can help in such situation. They do not load data to cache before write.

0 Kudos
Highlighted
Black Belt
5 Views

Quoting - Dmitriy Vyukov
Ok, I see. Then the problem is most likely - false sharing.
I don't see: why then isn't the version using concurrent_vector similarly afflicted?
0 Kudos
Highlighted
5 Views

Quoting - Dmitriy Vyukov

The other possible problem is memory bus saturation. Cores share memory bandwidth. So if 1 thread saturates memory bus, then you can't get speedup from 2 threads.
You may try to calculate total bandwidth that your program causes. Note that even if you only write to a memory location, processor has to load the cache line first. And then compare obtained value with hardware declared bandwidth.


Thank you Dmitriy. Yoursuggestions arevery useful. As to the former possible problem, i.e. false sharing, I have tried to modify the code using vector, cache_aligned_allocator > > instead of vector >. However the performance still has little improvements. And the second problem called memory bus saturation seems reasonable, as in my program, every thread is writing to memory simultaneously while the calculation time is very short (x+y+z). Unfortunately I don't know how to solve this problem.
0 Kudos
Highlighted
5 Views

Quoting - Raf Schietekat
A concurrent_vector does not protect against races regarding any given element (this is still your responsibility), so concurrent_vector is not a proper substitute for vector > in this case; also, there will be some overhead for the privilege of changing the collection while having references to elements remain valid (is this the "almost twice speedup" that you mention?), so if its size doesn't change during processing, this is not the type you should normally use. On x86, reading from and writing to an atomic typically carry no penalty compared to the same operations on the non-atomic type.

I don't see why a vector would not allow you to scale when a concurrent_vector does (the example of the concurrent_vector seems to rule out sharing issues). How about a plain dynamically allocated array, to rule out any quirks in the std::vector implementation? While you're at it, also try a version that only accesses every 16th or so element of the array, to really rule out sharing issues. If that doesn't help, there must be an issue related to the algorithm (unless I overlooked anything?).

Actually when using concurrent_vector, I used this as the data structure to store the results for the serial version too. This is incorrect, because for serial version, it doesn't need this, and may greatly slow down. So I changed it later, that is, vector for serial version and concurrent_vector for parallel version. Now the performance speedup is around 0.70, meaning no speedup at all. So I give up the method of using concurrent_vector.
My point is,when using the same data structure (concurrent_vector, vector >) for both serial and parallel versions, I got very different speedup (1.8 for concurrent_vector, andonly 1.1 for vector >).This at first seemed weird.Well, this may come from the difference of the data structures.
And Raf Schietekattold me that concurrent_vector doesn't protect against data races. I checked the source code of concurrent_vector and the documents, the operator[] is only thread-safe for reading but not for writing. So again Iwill not consider concurrent_vector as a choice.
0 Kudos
Highlighted
5 Views


Thank you Dmitriy. Yoursuggestions arevery useful. As to the former possible problem, i.e. false sharing, I have tried to modify the code using vector, cache_aligned_allocator > > instead of vector >. However the performance still has little improvements. And the second problem called memory bus saturation seems reasonable, as in my program, every thread is writing to memory simultaneously while the calculation time is very short (x+y+z). Unfortunately I don't know how to solve this problem.
For discussion's convenience, the latest code is uploaded. See the attchments.

0 Kudos
Highlighted
5 Views

Quoting - Dmitriy Vyukov

Maybe non-temporal store SSE instructions can help in such situation. They do not load data to cache before write.

This is very interesting. But how can I use these SSE instruction in C++ program? byinserting assembly language?
0 Kudos
Highlighted
5 Views

Quoting - Dmitriy Vyukov

The other possible problem is memory bus saturation. Cores share memory bandwidth. So if 1 thread saturates memory bus, then you can't get speedup from 2 threads.
You may try to calculate total bandwidth that your program causes. Note that even if you only write to a memory location, processor has to load the cache line first. And then compare obtained value with hardware declared bandwidth.


According to my code, the size of vector > is about 16M. for an int type, there is only 64M needed to be written into memory (32-bit OS).
Another question is that will the compiler optimize the writing instructions to not using cache? Now I'm trying bypassing the cache.
0 Kudos
Highlighted
Valued Contributor I
5 Views

Ok, I see what's the problem.

First of all, in your code no two threads store to the same location, so I used just std::vector.

Speedup on my quad-core machine is 1.77.

When I changed computation from x = v1 + v2, to x = (int)((v1 / (v2 * 3.0 + 1.0)) / (v1 * 2.0 + 3.0)), i.e. made it more complex (note division of floating point values), I've obtained speedup 3.92. This suggests that the problem is memory bus saturation.

There is not much you can do with this.
You may try to use smaller data type, i.e. short instead of int, if your problem allows.
Or if you have 1 memory intensive phase and 1 computation intensive phase, you may try to combine them in order to "dilute" memory operations.




0 Kudos
Highlighted
Valued Contributor I
5 Views

I've applied non-temporal store instructions, and they improve situation somehow. You may try to use _mm_store_si32() intrinsic:

[cpp]#include 

__forceinline
void compute(int& d, int v1, int v2)
{
    //int v = (int)((v1 / (v2 * 3.0 + 1.0)) / (v1 * 2.0 + 3.0));
    int v = v1 + v2;
    
    _mm_stream_si32(&d, v);
    //d = v;
}
[/cpp]

With non-tempotal stores I've obtained 2.28 speedup, instead of 1.77 with plain stores.

0 Kudos
Highlighted
5 Views

Quoting - Dmitriy Vyukov
Ok, I see what's the problem.

First of all, in your code no two threads store to the same location, so I used just std::vector.

Speedup on my quad-core machine is 1.77.

When I changed computation from x = v1 + v2, to x = (int)((v1 / (v2 * 3.0 + 1.0)) / (v1 * 2.0 + 3.0)), i.e. made it more complex (note division of floating point values), I've obtained speedup 3.92. This suggests that the problem is memory bus saturation.

There is not much you can do with this.
You may try to use smaller data type, i.e. short instead of int, if your problem allows.
Or if you have 1 memory intensive phase and 1 computation intensive phase, you may try to combine them in order to "dilute" memory operations.




Thank you very much. I tested the code as you did, and got the similar results. I'm thinking about how to make the algorithm not write to memoryso often.
0 Kudos